Considerations of variability and power for long-term monitoring of stream fish assemblages

Little attention has been given to optimizing statistical power for monitoring stream fish assemblages. We explored the relationship between temporal variability and statistical power using 34 metrics from fish community data collected annually at six sites over 10 years via electrofishing. Metric variability differed by the life stage and group of species considered, use of abundance or mass data, and data standardization technique. Lower variability was associated with community data, abundance data, and time-based standardizations, while greater variability was associated with young-ofthe-year data, mass data, and area-based standardizations. Simulation-based power analysis indicated metric choice, and to a lesser degree, monitoring design (annual, biennial, endpoints, or haphazard sampling) influenced power to detect change. Across a fixed number of surveys (N = 60), endpoints sampling performed best. The N needed to detect change was heavily dependent upon metric choice for all monitoring designs, with the most biologically specific metrics requiring greater N. Large savings in effort and resource expenditure can be obtained utilizing biologically relevant metrics that are robust to temporal noise within an appropriate sampling design. Résumé : L’optimisation de l’ efficacité statistique pour la surveillance des assemblages de poissons de cours d’ eau a reçu peu d’attention. Nous explorons le lien entre la variabilité temporelle et l’ efficacité statistique en utilisant 34 paramètres tirés de données sur des communautés de poissons recueillies annuellement dans six sites, sur une période de 10 ans, par pêche électrique. La variabilité des paramètres diffère selon l’étape du cycle biologique et le groupe d’ espèces examinés et le type de données (abondance ou masse) et la méthode de normalisation des données utilisés. Une plus faible variabilité est associée aux données sur les communautés, aux données sur l’ abondance et aux normalisations basées sur le temps, alors qu’une plus grande variabilité est associée aux données sur les jeunes de l’ année, aux données sur la masse et aux normalisations basées sur la s uperficie. L’analyse de l’ efficacité basée sur la simulation indique que le choix des paramètres et, dans une moindre mesure, le schéma de surveillance (annuel, bisannuel, indicateurs d’effet ou échantillonnage à l’av euglette) influencent l’ef ficacité de la détection de changements. Pour un nombre fixe de relevés (N = 60), l ’échantillonnage d’in dicateurs d’ef fet donne les meilleurs résultats. Le N nécessaire pour détecter des changements dépend fortement du choix des paramètres pour tous les schémas de surveillance, les paramètres les plus spécifiques sur le plan biologique nécessitant des N plus grands. D’importantes réductions de l’ effort et des ressources dépensées peuvent être obtenues en utilisant des paramètres pertinents du point de vue biologique sur lesquels le bruit temporel a peu d’influence, dans un schéma d’éc hantillonnage approprié. [Traduit


Introduction
A great deal of time and resources have been devoted to environmental monitoring over the past few decades, yet notably less effort has been devoted to understanding the ability of monitoring programs to detect meaningful changes in important physical, chemical, and biological indicators. There are numerous examples of long-term monitoring programs that have collected invaluable information that informed future policy and management actions (Lindenmayer and Likens 2018;Lovett et al. 2007;Sullivan et al. 2018). However, there is also a growing awareness that limited conservation resources need to be allocated judiciously and that not all monitoring efforts are adequately designed to accomplish program objectives Likens 2010, 2018). The titles of publications such as "Wh y most conservation monitoring is, but need not be, a waste of time" (Legg and Nagy 2006), "Monitoring does not always count" (McDonald-Madden et al. 2010), and "Making monitoring meaningful" (Field et al. 2007) indicate the increasing concern in the scientific community about inadequacies in the current state of ecological monitoring. These papers identify common shortcomings in ecological monitoring and recommend a number of approaches to increase its value.
The consideration of statistical power is one of the central themes in most calls for improving ecological monitoring (Field et al. 2007;Legg and Nagy 2006;Lindenmayer and Likens 2010). Power is defined as the probability of detecting a given change in resource condition and is inversely related to the probability of making a type II error or erroneously concluding that a change in resource condition has not occurred when it has (Caughlan and Oakley 2001;Cohen 1992;Fairweather 1991). Thus, a study that lacks sufficient power is more prone to committing a type II error. This is an important consideration in conservation monitoring because critical management actions may be withheld under the false pretense that the resource has not been degraded (Caughlan and Oakley 2001;Fairweather 1991). Thus, a priori considerations of power are critical to determine the spatial and temporal sampling intensity necessary to achieve study objectives.
Long-term environmental monitoring programs in the Ashokan watershed and across the greater Catskill Mountains region of New York, USA, are important for several reasons. Reservoirs in this region provide 90% of the drinking water supply to more than 8 million residents of New York City (Palmer et al. 2008), while a number of streams and rivers in the region provide renowned trout fishing opportunities (Van Put 2007). Local economies in the Ashokan watershed depend heavily on the estimated 15 000 tubers and large but unknown number of fly anglers that frequent the upper Esopus Creek each year (CCE 2007a). Increasing demands on the funds allocated for environmental research and monitoring in this region and elsewhere, however, necessitate more efficient monitoring strategies. The primary objective of this paper is to help develop efficient monitoring strategies that assess short-and long-term changes in the condition of fish assemblages in the upper Esopus Creek and other watersheds where detecting change in resource condition is critical for sound management. Such efforts are essential to understand (i) the normal range of variability in these resources, (ii) the effects of current and future stressors, and (iii) management options that could best protect and sustain valuable fishery resources while balancing competing uses for water.
Researchers monitoring stream fish assemblages must make numerous decisions about what information to collect and how to utilize those data, yet the implications of these decisions for statistical power are often unknown. A number of fisheries studies have explored how different experimental designs (e.g., how many sites to sample, on what interval, how to select sites, etc.; Dauwalter et al. 2010;Urquhart and Kincaid 1999) and s ampling techniques (Al-Chokhachy et al. 2009;Hanks et al. 2018) affect statistical power. However, beyond these broad decisions, researchers interested in determining whether a trout fishery is decreasing over time must also decide whether to estimate the size or mass of the population, include fish of all year classes or just a subset, and how to standardize those data. These specific decisions are generally categorized as the "response design" or the process of deciding what to measure and how to measure it (Stevens and Urquhart 2000). The response design also has the potential to affect the statistical power of the monitoring program, yet only a limited amount of research has explored this topic. Dauwalter et al. (2009) considered differences in power between abundance and biomass measures, Al-Chokhachy et al. (2009) andDauwalter et al. (2009) explored the effects of using different size or age classes of fish, and  explored the effects of data standardizations on power in headwater stream fish assemblages. Although additional information on these topics is needed, these studies clearly indicate that specific decisions related to the response design have the potential to affect statistical power.
In this study, we collected and analyzed a suite of fish community data from the Ashokan watershed to explore the relationship between interannual (hereinafter "temporal" ) variability and statistical power across 34 metrics calculated using different sampling protocols, types of data, and standardization techniques. We hypothesized that temporal variability would vary by metric class and that metrics with lower temporal variability could achieve greater statistical power for detecting long-term change. To test this hypothesis, we calculated coefficients of variation (CV) for each metric and used linear mixed models to investigate changes in CV for different classes of metrics and standardization techniques. We then conducted a simulationbased power analysis to determine sample sizes needed to achieve 80% power for each metric under variable effect sizes and monitoring scenarios (frequency and configuration of sampling events). We used these simulation results to (i) evaluate trends in statistical power using four different monitoring scenarios and (ii) determine which metrics required the least amount of sampling effort to detect a predetermined effect size.

Methods
Fish community surveys were conducted annually at six study sites located in the Ashokan watershed of the Catskill Mountains, New York, from 2009 through 2018 (Fig. 1). Three sites were located on the main stem of the upper Esopus Creek, and three were located on major tributaries. The sites ranged in drainage area from 10.3 to 165.0 km 2 and in elevation from 268 to 455 m (Table 1). A single reach ranging from 54 to 100 m in length and encompassing one or two complete geomorphic channel-unit sequences (Fitzpatrick et al. 1998;Meador et al. 2003;Simonson et al. 1994) was sampled annually at each site.
Fish surveys were conducted between late June and early August using multipass depletion electrofishing surveys. Fish were collected from seine-blocked reaches during three consecutive passes using a Smith-Root LR-24 backpack electrofisher and three to five netters. A fourth pass was conducted during three surveys in which the rate of depletion during the first three passes was inadequate to produce reliable population estimates. All fish were identified to species, measured, weighed, and returned to the stream after all passes were completed. In the case of small, highly abundant species, lengths and weights were obtained from a subsample of 30 individuals, after which batches of up to 30 similarly sized fish were processed together using a pooled weight and a single representative length. During each survey, the electrofishing time was recorded, and the reach length, and widths of 10 evenly spaced transects were measured and used to calculate mean reach width and total area sampled. Raw data from the fish community surveys and the dimensions of the surveyed reaches are available in George and Baldigo (2018).

Analysis of variability in fish metrics
The data from electrofishing surveys were used to calculate 34 fish metrics for use in statistical analyses ( Table 2). The number and mass of fish captured during each pass were used to estimate abundance and biomass for three groups at each site using the Carle-Strub method (Carle and Strub 1978) with the " FSA" package (Ogle et al. 2018) in R ( R Core Team 2019). Estimates were produced for the entire community, all trout species combined, and young-of-the-year trout (hereinafter " all-fish", "trout", a nd "YOY trout", respectively). The trout and YOY trout groups composited all trout species in the study area, which included brown trout (Salmo trutta), rainbow trout (Oncorhynchus mykiss), and occasionally brook trout (Salvelinus fontinalis). Length cut-offs for designation as YOY were identified using length frequency distributions and were <101 mm for brown trout and brook trout and <91 mm for rainbow trout. The resulting estimates of abundance and biomass for these three groups were standardized by (i) the total area sampled in each survey to produce estimates of density and biomass per unit area and (ii) the sampled reach length to produce estimates of density and biomass per unit of stream length. Additionally, the number and mass of fish captured during the first electrofishing pass of each survey were used to produce " singlepass" density and biomass metrics by reach area, reach length, and electrofishing time for all-fish, trout, and YOY trout. Finally, four diversity metrics (Shannon's index, S impson's i ndex D (reported as 1 -D), Pielou' s evenness, and species richness) were calculated from the first pass of each survey using the "vegan" package (Oksanen et al. 2017) in R (R Core T eam 2 019).
The temporal variability of each metric was expressed p am on g years for each site using the CV. The CV is calculated as ffiffiffiffi S 2 ffi = X, where S 2 and X are the variance and mean, respectively, of n observations (Power 2007). The CV is frequently used to summarize temporal variability in animal populations because it is unitless and invariant with magnitude (i.e., standardized; Dauwalter   . 2009). Metrics with a higher mean CV (calculated as the average of the CV from each of the six sites) exhibit greater temporal variability, while those with a lower mean CV exhibit less variability over time. The variability associated with four metric classes was assessed in a linear mixed effects model using CV as the response variable and metric classes as fixed effects to determine which types of metrics were prone to greater temporal variability. The metric classes used as terms in the model were abundance-or mass-based metric, standardization technique (reach area, reach length, or time), species group (all-fish, trout, or YOY trout metric), and number of passes (single-pass or multipass metric). The four diversity metrics were not included in this analysis given their unique nature relative to the other 30 metrics. We included a random effect of "site" on the intercept to account for the repeated sampling of individual sites over time (Bolker et al. 2009). Histograms of the residuals and scatterplots of the fitted values versus the residuals were evaluated to ensure the assumptions of normality and homoscedasticity were met (Zuur et al. 2010). The analysis was conducted using the "nlme" package (Pinheiro et al. 2017) in R (R Core Team 2019) assuming a type I error rate (a ) = 0.05.

Power analysis for long-term monitoring
We used simulations to determine statistical power to detect changes in the 34 metrics within four monitoring scenarios. The scenarios considered different sampling frequency at the six sites described previously, while supplementing with unobserved hypothetical sites to allow for up to 120 total sites per year for theoretical purposes. We used linear mixed models to estimate temporal changes in simulated metrics (y) over 10 years (from year zero through year nine for data simulation reasons) as a function of year as a continuous covariate and site (j) as a r andom effect on the intercept. We used a study period of 10 years to be representative of both the timeframe for ecological change and the period over which studies often occur, although we recognize that the results of the simulation are inevitably related to duration of the study period chosen, which may be unknown in some cases.
We used four monitoring scenarios to represent different temporal sampling designs in ecological monitoring: annual sampling, biennial sampling, haphazard sampling, and endpoints sampling (Table 3). Each monitoring scenario included a minimum of one survey per site per year of study, and the minimum number of surveys over the study period ranged from 12 surveys for the endpoints scenario to 60 surveys for the annual scenario (Table 3). For each scenario, we also considered additional samples at unobserved (hypothetical) sites within years of study, ranging from an additional 0 through 114 sites per year for a range of 6-120 total sites per year [6, 7, 8, . . ., 30, 45, 60, . . ., 120]. While conducting 120 surveys per year may be logistically impractical, this number of surveys provided theoretical upper thresholds for interpreting study results. This resulted in a maximum of 240 surveys for the endpoints scenario and a maximum of 1200 surveys in the annual scenario. These provide a conservative Note: All scenarios occur over a 10-year period from years 0 to 9.  under the biennial scenario was 30 (six sites in each of 5 years). NA indicates the effect was not detected at the upper limit to potential sampling intensity because both are likely well beyond the capacity of most fish monitoring programs.
For each iteration (i) of the simulation, we randomly selected a metric, a monitoring scenario, a number of sites per year, and an effect size (d ). Effect size was expressed either as a proportional increase such that d [ [1.01, 1.02, . . ., 1.10, 1.2, . . ., 2.0, 2 .5, . . ., 5.0] or a decrease (as 1 d ). Thus, an effect size of 1.5 represents a 50% increase (d ) or a 33% decrease (d 1 ) in m etric v alue. B oth the magnitude and direction (increase or decrease) were chosen at random for each iteration i. We defined the log e mean at year zero (a , y intercept) separately for each site (j) based on the sitespecific, log e mean (m j ) for the metric selected using data collected 2009-2018. For unobserved sites, we drew values of a from a normal distribution defined by the pooled log e means and standard deviations for the selected metric across observed sites. For all years in the simulation study, the mean value of the metric (ŷ ij ) in each year (t) was the outcome of a linear predictor: where X i was year from 0 to t, and b i was the log e -scale change over a 10-year period based on d (increase) or 1 d (decrease), divided by the number of survey years (S = 10 for all simulations): Importantly, we did not specify a random effect on b i , or explicitly specify an interaction between a ij and b i , which would have allowed metrics to change in different ways between sites across time (site time interaction). However, researchers are commonly interested in whether changes occur similarly across sites or groups. We note, therefore, that our estimates of power are optimistic if researchers are also interested in site-or group-level random effects on b i . Finally, we drew the simulated response y ij in each year from a normal distribution with a mean of ŷ ij and the site-specific standard deviation of s ijt for the metric selected using data collected 2009-2018: where Z ijt was random error drawn from a standard normal distribution with a mean of zero and a standard deviation of one following Hayes et al. (1995). We assumed that metric CV remained constant over the time period of the study and that s ijt scaled linearly with the magnitude of the metric, although alternative error structures are readily implemented through this framework. To maintain this assumption in the generative model, we used the site-specific CV for the selected metric to derive log escale s ijt by rearran p gi the ng relationship between log e -scale CV and variance, CV ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi e s 2 1 ffi , a s Simulations were run in parallel using the "snowfall" package (Knaus 2015) in R ( R Core T eam 2 019). We ran the simulation 20 million times to ensure adequate coverage of the parameter values and scenarios used in simulation. We then analyzed the simulated responses using a linear mixed model of the same form as that used for simulation, with a random effect of site on the intercept and a fixed effect of year as a continuous covariate. For each iteration of the simulation, we stored the p value for the fixed effect of year and assessed statistical significance assuming an a of 0.05. If a model failed to converge, or a significant effect was detected in the opposite direction of that specified, we considered these as failures to successfully reject the null hypothesis due to study design and recorded a p value of 1.00. We calculated power for each combination of metric, monitoring scenario, effect size, and sample size as the proportion of simulations resulting in the rejection of the null hypothesis that metrics did not change over time.
The performance of the four monitoring scenarios was assessed with two different approaches. First, mean power to detect a change of d = 1.5 was compared between monitoring scenarios across all sample sizes and metrics to understand differences in gross power between scenarios when the number of surveys was not limiting. These results have the potential to be misleading, however, because the power of a given monitoring scenario to detect change is inextricably linked to the minimum number of surveys (N) required for implementation, which varied between scenarios. Therefore, we standardized this comparison by the number of surveys conducted across the 10-year period using the smallest sample size (N = 60) common to all monitoring scenarios to create an index of power to detect d = 1.5 that could be compared between scenarios. This comparison is analogous to a scenario where an agency has sufficient funding for 60 surveys over a 10-year period and seeks to determine which monitoring scenario optimizes the statistical power of those funds.
The performance of individual metrics was assessed within each monitoring scenario by determining necessary sample size (N) to detect each d . We used the "akima" package (Akima and Gebhardt 2016) in R (R Core Team 2019) to interpolate power across N and d and extracted the values of each that corresponded to a power of 0.80 to construct power curves for each combination of monitoring scenario and metric. A power of 0.80 (80% chance or a probability of 0.80 of detecting a given effect size) was used as a matter of convention (Cohen 1992) and for consistency with similar investigations of power in the fisheries field (Dauwalter et al. 2009;Wagner et al. 2013), but similar information for any level of power is available in the output from the simulation. We then determined the minimum sample size required to detect d = 1.5 (N d = 1.5 ), d = 2.0 (N d = 2.0 ), and d = 5.0 (N d = 5.0 ) for each metric. We used linear regression to relate N d = 1.5 to the metric CV observed in the real-scale empirical data. We log etransformed the response to avoid negative predictions and account for heteroscedasticity as log e ð N d 1: where b 0 was the intercept and b CV was the effect of metricspecific CV on sample size needed to detect a change of d = 1.5 i n the corresponding metric.

Results
Results from the 60 surveys indicate that fish communities in the upper Esopus Creek and tributaries were primarily composed of the cottid, salmonid, and cyprinid families, although the relative abundance of each varied by site and year. Species richness ranged from 4 to 11 taxa and slimy sculpin (Uranidea cognata, synonym Cottus cognatus) was the most frequently captured species, comprising 38% of the total catch and 17% of the total mass across all surveys. Brown trout were captured during all 60 surveys, while rainbow trout were captured during 58 of the 60 surveys, and brook trout were captured in 34 of the 60 surveys (George and Baldigo 2018). Trout (all three species composited) comprised 21% of the catch and 44% of the mass across the entire dataset. All-fish density derived from multipass sampling ranged from 495 to 3769 fish/0.1 ha and 244 to 3195 fish/100 m, while comparable values of all-fish biomass ranged from 2102 to 42 472 g/0.1 ha and 1215 to 33 769 g/100 m (Table 2).
A wide range in temporal variability was observed among the 34 metrics. Mean CV ranged from 0.119 for species richness to 1.014 for first-pass YOY trout biomass by reach area (Table 2). In general, diversity-based metrics were the least variable, followed by all-fish metrics, trout metrics, and YOY trout metrics (Fig. 2).
Of the 34 metrics analyzed, the 10 YOY trout metrics exhibited the 10 highest mean and median CV values observed in the study (Table 2; Fig. 2). Similarly, the four diversity-based metrics produced the four lowest mean and median CV values observed in the study. The linear mixed model identified several factors that were significant predictors of the observed variability in fish metrics. Metric CV varied as a result of species group (F [2,168] = 72.80, p < 0.001) and abundance or mass (F [1,168] = 15.33, p = 0.001) and, to a lesser degree, standardization technique (F [2,168] = 2.06, p = 0.131) and number of passes (F [1,168] = 1.83, p = 0.178). Species group was the most influential factor, and the CV of YOY trout, trout, and all-fish metrics averaged 0.85, 0.60, and 0.44, respectively (Table 4). The mean CV of abundance-based metrics (0.58) was reduced by 16% relative to mass-based metrics (0.69). Within the standardization technique factor, the CV of length-, time-, and area-based metrics averaged 0.63, 0.59, and 0.68, respectively. The number of passes did not strongly affect CV, although multipass metrics yielded a small reduction in CV.
The power analysis indicated that the monitoring scenario most likely to result in detection of simulated effects was the annual sampling scenario. The annual sampling scenario resulted in a mean power of about 0.59 to detect an effect size (d ) of 1.5 across all metrics and sample sizes (Fig. 3). By comparison, the haphazard sampling scenario performed the worst on average and detected an effect of d = 1.5 with a power of 0.45 across all metrics and sample sizes. We used the lowest common N present in all four scenarios (60 surveys) to standardize the mean power (across all metrics) to detect a change of d = 1.5 as an index of statistical power that could be used to assess efficiency of each monitoring scenario. When sample size was standardized this way, the endpoints scenario resulted in the greatest mean power at N = 60 (0.53) and achieved notable separation from the other three scenarios, which ranged in mean power from 0.35 to 0.37 (Fig. 3). We focus the reporting of the remaining analyses on the biennial scenario for simplicity and due to the current usage of this design in other trout monitoring programs (Eaglin et al. 2007), although the same analyses were run for all monitoring scenarios with similar patterns.
An analysis of the performance of individual metrics within the biennial scenario indicated large differences in the number of surveys needed to detect predetermined effect sizes at a fixed level of power between metrics. We were unable to detect a change of d = 1.5, 2.0, or 5.0 with power of 0.80 using the 10 YOY trout metrics with up to 600 surveys, the maximum considered in this scenario (Table 2). Of the remaining 24 metrics, the number of surveys required to detect d = 1.5 ranged from N = 30 (the Fig. 2. Boxplots showing the coefficients of variation (CV) from all six sites for each of 34 fish metrics. Metrics are shaded by the species group factor such that "all-fish" metrics are white, "trout" metrics are light gray, and "YOY trout" (i.e., young-of-the-year) metrics are dark gray. minimum considered in the biennial scenario) for the four diversity metrics to N = 393 for the trout biomass by length metric. Using d = 2.0, three additional metrics calculated from all-fish density data were able to achieve a power of 0.80 with N = 30 ( Table 2). All metrics were able to detect a change of d = 5.0 with a power of 0.80 and N = 30 with the exception of the YOY trout metrics. The relationship between variability and power was explored within the biennial scenario using simple linear regression of the mean CV (average of the CV from each of the six observed sites) and the number of surveys (N) needed to detect a change of d = 1.5 with a power of 0.80 for each metric. Only 24 metrics were included in this analysis because the 10 YOY trout metrics failed to detect a change of d = 1.5 with any N considered. We found a signi 2 ficant relationship (R = 0.93, t [22] = 16.82, p < 0.01) between mean CV and N (Fig. 4). Metrics with mean CV < 0.3 required ≤42 surveys to achieve the desired power, metrics with CV = 0.3-0.7 required 82-393 surveys, and metrics with CV > 0.7 failed to detect a change of d = 1.5.

Discussion
We identified large differences in the temporal variability of different classes of metrics used to monitor fish assemblages and a strong positive relationship between metric variability and number of surveys required to achieve desired statistical power. These findings supported our hypothesis and the findings of others (Ham and Pearsons 2000;Wagner et al. 2007) that temporal variability can obscure the detection of long-term changes in fish metrics. The sampling design in which metrics were utilized (e.g., annual or less frequent sampling) also affected statistical power. More importantly, individual metrics varied greatly in the sample size required to detect fixed levels of change with a power of 0.80. This suggests that some metrics have little practical value in long-term monitoring given their inability to detect anything less than a catastrophic change in resource condition and the immense sample size necessary to do so. Together, these results indicate that large savings in monitoring effort and resource expenditure can be obtained during the response design by utilizing biologically representative metrics that are robust to temporal noise within the most appropriate sampling design.
Among the four sampling designs, the endpoints scenario produced the greatest mean power at a fixed number of surveys and therefore represented the most cost-effective monitoring scenario. This is an experimental design principle that is well established under names such as oversampling and extreme group analysis in a variety of fields (Preacher et al. 2005;Vaughan 2017). The mean power across all metrics to detect d = 1.5 with N = 60 was 0.53 in the endpoints scenario, far greater than that of the other three scenarios in which mean power ranged from 0.35 to 0.37. Although the endpoints scenario was the most cost-effective approach in the present study, there are many situations in which this sampling scenario would not provide adequate contrast or capture change over a temporally appropriate scale, both common flaws in the design of ecological monitoring programs (Lindenmayer and Likens 2018). Furthermore, the "ends" of a monitoring period are rarely known or may be indeterminate, so this design may not be realistic or practical to implement for trend detection. When differences in sample size (total number of surveys) were not considered, annual sampling resulted in the greatest statistical power to detect change. This result was anticipated, and it is well established that increased frequency of sampling results in increased statistical power to detect effects of interest for ecological parameters characterized by high degrees of variation (e.g., Schweiger et al. 2016). In general, for highly variable metrics or in cases where minimization of type II error rates is paramount (e.g., endangered species monitoring), annual sampling represents the "gold standard" for ecological monitoring. Similarly, annual or biennial monitoring may be a requisite component of adaptive resource management plans at localized scales (Biber 2011). However, where landscape-scale changes are of interest (e.g., large number of sites) or resources are otherwise limiting, the amount of effort required of the annual monitoring scenario may be cost-prohibitive. As a result, we focused the reporting of individual metric performance within the biennial scenario because it represents a compromise between the annual and endpoints scenarios, is currently used in trout population monitoring (Eaglin et al. 2007), and achieved power similar to the annual monitoring design at N = 60.
The temporal variability in YOY trout metrics was so large that the utility of this metric class to detect change over time may be minimal. The YOY trout metrics were included in this analysis because protecting and promoting trout recruitment is an important goal of state and local stream managers (CCE 2007b). These metrics are surrogates for spawning success and may indicate future cohort strength. This life stage is also one of the most sensitive to acidification (Baldigo and Lawrence 2001;Simonin et al. 2005), making it a valuable indicator of water and habitat quality. The mean CV of YOY trout metrics was 41% and 93% higher than that of trout and all-fish metrics, respectively, and the 10 YOY metrics produced the 10 highest mean and median CVs of all 34 metrics. This finding is consistent with those of Dauwalter et al. (2009), showing that temporal variability of a single trout age class was greater than that of multiple age classes together. Dauwalter et al. (2009) were not able to include YOY (age 0) fish in their assessment, however, and therefore stated they could not determine what effect their inclusion might have on the variability of abundance or biomass metrics. Our results address this question and clearly indicate that the inclusion of YOY in trout metrics increases temporal variability and reduces statistical power. This finding is not unexpected because interannual variability in trout recruitment varies greatly in the Esopus Creek (George and Baldigo 2016;George et al. 2015) and elsewhere (Cattanéo et al. 2003;Unfer et al. 2011). In our simulation, YOY trout metrics could not achieve a power of 0.80 to detect a change of d = 1.5, 2.0, or 5 .0. T his is p roblematic b oth b ecause the number of surveys needed to detect change with these metrics appears to be prohibitively large, and any effect sizes that could be detected would likely be far in excess of those targeted by most monitoring programs and would be at risk of missing ecologically meaningful changes to fish populations.
The strong performance of time-based metrics was an unexpected finding in this assessment and may reflect gear saturation. Time-based metrics had lower temporal variability than area-or length-based metrics, and the " all-fish first-pass density time" metric required the fewest surveys to achieve adequate power of any metric with the exception of the four diversitybased metrics. These findings are consistent with those of a similar study in headwater streams, which found that time-based metrics were less variable than metrics standardized by area or length and generally required similar or smaller sample sizes as other abundance-based metrics to achieve adequate power . Metric power and metric accuracy are not necessarily synonymous, however, and a metric could conceivably achieve low temporal variability and therefore high power because of a consistent bias (e.g., underestimation of extreme values). The data in this study, as well as those in George et al. (2019), Fig. 4. Relationship between the mean coefficient of variation (CV) and number of surveys (N) needed to detect a change of d = 1.5 with a power of 0.80 for 24 fish metrics using the biennial sampling scenario. Black dots are raw data, the black line is the mean predicted number of surveys required, and the gray polygon indicates the 95% prediction interval. Ten metrics were excluded from the regression because the specified effect size could not be detected in those metrics at any N considered.
suggest that time-based metrics were strongly correlated with area-and length-based metrics at low-to-moderate fish densities, but that time-based metrics plateaued, and this relationship was not maintained at higher fish densities. This likely reflects gear saturation occurring at higher densities, at which point capture efficiency (the probability of capturing an individual fish in a unit of effort) diminishes. In an examination of boat electrofishing data, Marcy-Quay et al. (2019) showed that when fish densities were high, a time-based measure of effort resulted in hyperstable catch per unit effort (relative to spatial standardizations) due to gear saturation. Although not as well documented for backpack electrofishing, gear saturation could similarly result in greater temporal variability of area-and length-based metrics if they are more capable of documenting the upper maxima of the natural range in abundance than time-based metrics. Therefore, our results suggest that the lower variability and greater power observed in time-based metrics should be interpreted cautiously and may suggest this metric class is not ideal for trend detection.
Our study found significantly lower variability in abundancebased metrics relative to mass-based metrics, but the results of other studies suggest this finding may vary by stream size or other habitat factors. For example, Dauwalter et al. (2009) did not find a clear pattern in the relative variability in abundance and biomass of populations of different trout species from streams across North America. Similarly, in a study of headwater streams in New York,  found that most mass-based metrics had slightly lower variability than their abundance-based counterparts, but this difference was not statistically significant. One source of this inconsistency may be attributable to differences in the size of the streams considered. For example, in , the average drainage area of the 13 study streams was 9.5 km 2 compared with 67.2 km 2 in this investigation. In the present study in the Ashokan watershed, it was common to encounter one or two large individual fish that composed a large portion of the entire biomass at a location. For example, a brown trout with a mass of 1100 g was captured during the 2010 survey at site esop3a, comprising 45% of the total mass collected in that survey. Thus, on the Esopus Creek and tributaries, the presence or absence of one or two large fish (that may move in or out of the study reach at random) on a given day can create high temporal variability in mass-based metrics over time. This source of variability could potentially be reduced by sampling longer reaches such as the minimum 150 m recommended by the US Environmental Protection Agency (USEPA 2017), but conducting multiple passes on reaches of this length can be cost-and time-prohibitive. Overall, however, it appears that the variability and power of abundance-and mass-based metrics will vary by the study area and cannot be broadly generalized.
Although our simulation study addresses a number of important and understudied aspects of stream fish assemblage monitoring, the findings should be viewed within the broader context of environmental monitoring. First, the metrics we used in the simulation were chosen because they are regionally appropriate for the low-diversity streams in the Catskill Mountains. Many of these metrics have been used extensively in New York to evaluate the impacts and recovery from acid deposition Baldigo and Lawrence 2001;Simonin et al. 2005). Researchers working in other regions will likely want to consider a different suite of metrics that effectively characterize the condition of local assemblages. For example, more diverse assemblages with species spanning a wider range of life histories and trophic positions might be more effectively characterized using functional metrics within an index of biological integrity (Karr 1981). However, the assessment of metric power we present here can be used as a framework to evaluate the performance of any suite of metrics to inform monitoring in a wide range of conditions. Second, and more importantly, low temporal variability in a metric does not imply that the metric adequately characterizes the resource of interest. Maximizing power to detect a change in long-term monitoring is only valuable if the metrics considered provide relevant biological information over an applicable timeframe. In our study, the most general metrics (aggregated diversity and all-fish metrics) had the smallest CV and highest power to detect change, whereas biologically specific metrics (trout and YOY trout metrics) had the largest CV and lowest power to detect change. This suggests there may be some trade-off between ability to detect change and the biological specificity or relevance of a metric. Obviously, a researcher interested in determining whether trout populations are declining over time would not use a metric of the entire fish community simply because it was less variable than a trout-only metric. However, using a YOY metric in the streams studied here would lead to singular failure of monitoring to detect change under the type I error rate assumed (a = 0.05) and would become increasingly difficult under more complex experimental designs, such as those incorporating site time interactions. Balancing the optimization of power and biological relevance of candidate metrics is, therefore, an important consideration during the response design.
The findings from this study address a number of understudied topics in response design (Stevens and Urquhart 2000) and have important implications for optimizing fish monitoring efforts in streams of the Catskill Mountains region and elsewhere. First, our findings generally supported those of others that there is often low power to detect trends in fish populations (Ham and Pearsons 2000;Wagner et al. 2013). Thus, natural resource managers should consider how critical it is to maintain a type I error rate of 0.05 or if achieving 80% or 90% confidence in a trend is sufficient to warrant management action (Dauwalter et al. 2009). However, our simulation found a large gradient in the statistical power that can be obtained from different fish metrics and, to a lesser extent, the monitoring framework in which they are utilized. Within the biennial scenario, the number of surveys needed to detect a change of d = 1.5 with a power of 0.80 ranged from 30 to greater than 300 surveys depending on the metric utilized. Given that the cost of a typical fish community survey may range from US$500 to US$3000 (Baldigo et al. 2017), the difference in resource expenditure needed to reach a power of 0.80 could vary by hundreds of thousands of dollars between metrics. Similarly, the standardized (N = 60) index of mean power across metrics ranged from 0.35 to 0.53 depending on the monitoring scenario utilized, suggesting that monitoring framework could affect resource expenditure on the scale of thousands to tens of thousands of dollars. This suggests that the decision of what to monitor (i.e., metric selection) may affect statistical power more strongly than the monitoring framework within which that metric is utilized, although both require careful consideration. Additionally, our simulation results indicate that metric power ranged from high to low across a gradient of broad to narrow biological specificity. As a result, diversity and community-based metrics achieved greater power than metrics summarizing only a subset of species or life stages. Thus, developing an effective monitoring plan is a complex process that should first consider the question of interest, the biological relevance of metrics, and finally the statistical power of metrics and monitoring designs.