Bridging mechanistic conceptual models and statistical species distribution models of riverine fish

Statistical species distribution models (SDMs) are widely used to quantify how taxa respond to environmental conditions and to predict their distribution. However, the application of SDMs to freshwater fish taxa is complicated by the active dispersal of fish taxa through river networks, and the speciesand habitat-dependent observation process (i.e., the sampling method and effort) required to accurately sample their distributions. Many studies have applied presence-absence models (PAMs) to fish taxa, while more recent studies have proposed zero-inflated models (ZIMs) to account for count observations with many zeroes. However, relatively few studies have incorporated the observation process into the model structure, which would facilitate the combination of data from various monitoring programs that differ in their observation process. In this study, we use conceptual models to identify potentially dominant natural and anthropogenic environmental conditions with a direct, mechanistic effect on the distributions of freshwater fish taxa in Switzerland, a region with a large range of environmental conditions, from alpine streams that are mainly affected by hydromorphological alterations to lowland streams in densely populated areas with intensive agricultural land use. Moreover, numerous barriers impede fish migration along the entire river network. Using combined data from two fish monitoring programs in Switzerland, we applied an exhaustive cross-validation procedure to select a set of environmental variables with the highest (out-of-sample) predictive performance for the PAM and ZIM for fish density (individuals/m2) of the seven most prevalent fish taxa (Salmo spp., Cottus spp., Squalius spp., Barbatula spp., Barbus spp., Phoxinus spp., Gobio spp.). We used these variables to develop a PAM and ZIM for each taxon that accounts for differences in sampling methods and sampling effort. We quantified the quality of fit during calibration using all samples and predictive performance during 5-fold cross-validation of each model. Results show that stream temperature and stream morphology within the accessible habitat commonly appear among the best predictive presence-absence models for multiple taxa. Spatial variables that account for migration barriers and quantify morphological conditions within the accessible habitat were selected for 6 out of 7 taxa. The selected PAMs performed well for all taxa with an intermediate prevalence (10–40%), with an explanatory power (D2) of between 0.32 0.37 during calibration using all samples and only minor decreases in explanatory power during cross-validation (D2= 0.34 – 0.44). As expected, the PAM for the highly prevalent Salmo spp. (91%) failed to predict the few absence data points. By contrast, the ZIM model performed best for Salmo spp., with a standardized likelihood ratio of 1.56. For all other taxa besides Barbus spp. the ZIM models also had likelihood ratios above one, indicating a better predictive performance than the null model. We hope this study stimulates the development and application of fish species distribution models based on prior knowledge of causally linked environmental variables and incorporating observation errors to improve their predictive performance. This can facilitate learning from biomonitoring data to support management.


Introduction
Freshwater ecosystems such as lakes and rivers are rich in biodiversity, however over-exploitation places freshwater ecosystems at greater risk of habitat destruction and degradation than their terrestrial and marine counterparts (Dudgeon et al., 2006;Vörösmarty et al., 2010). Within the European Union, nearly half (47%) of lakes and rivers failed to achieve a good ecological status in 2015 (as defined by the EU Water Framework Directive using indicators of the quality of biological communities), with many freshwater species either increasingly threatened or endangered, including mammals, birds, fish, insects and other invertebrate communities.
Fish species in Europe are of particular concern, with 37% of the 547 native species listed as threatened or endangered, and 17% of species populations in decline (Brooks and Freyhof, 2011). The main anthropogenic threats to Europe's freshwater fish species are the destruction, degradation, and fragmentation of habitat due to the channelization of natural river courses, pollution from intensive agriculture and urban areas, the construction of dams, and water abstraction (Gozlan et al., 2019). Studies have emphasized the need for additional biomonitoring efforts to more accurately characterize the geographic distribution and population trends of freshwater fish species to (1) improve future risk assessments, (2) identify underlying environmental drivers of species' distributions, and (3) better inform stream management (Gozlan et al., 2019;Vörösmarty et al., 2010).
Accurately assessing the total population of a fish species in a given river length can be difficult due to the active dispersal capabilities of fish (e.g., relative to benthic macroinvertebrates) and the observation error inherent to imperfect fish sampling methods. Electrofishing is perhaps the most common fishing method for stream biomonitoring, and involves passing a high-voltage current through a length of river to stun fish for easy collection and analysis. Additional sampling efforts may increase the proportion of fish caught (i.e., reducing observation error), including the placement of nets at the start and end points of the fished area to prevent individual fish from escaping prior to sampling, and performing multiple electrofishing "passes" or "rounds" to accurately quantify the number of individuals in a reach. Multiple fishing rounds can provide particularly useful data for fisheries management, including population estimates that can be derived based on the number of fish captured in successive fishing rounds and on prior knowledge of the probability of capturing an individual of a given species (e.g., Carle and Strub, 1978).
In obtaining presence-absence and abundance (i.e., count) observations from biomonitoring data, studies over the past two decades have applied statistical species distribution models (SDMs) to quantify how observed distributions of species respond to a range of natural and anthropogenic environmental conditions. Techniques in statistical and machine learning have been applied, ranging from classical statistical models such as generalized linear models (GLMs) and generalized additive models (GAMs) (e.g., Fukushima et al., 2007;Olden and Jackson, 2002) to machine learning algorithms such as random forests and boosted regression trees (Chee and Elith, 2012;Maloney et al., 2013).
Environmental conditions used as explanatory variables in fish SDMs include topography (e.g., elevation; Bond et al., 2011), soil and geology (e.g., Maloney et al., 2012), climatic variability (e.g., temperature, precipitation; Creque et al., 2005;McNyset, 2005), hydrological regime (e.g., flow velocity, discharge; Bond et al., 2011;McNyset, 2005), and habitat quality (e.g., substrates, hiding spots; Creque et al., 2005). Anthropogenic impacts due to land use (e.g., agriculture, urban areas), impaired river morphology, and (to a lesser extent) habitat fragmentation due to barriers (e.g., distance to barriers such as dams) (Radinger et al., 2017(Radinger et al., , 2019Rolls et al., 2014) have been incorporated into fish SDMs to quantify the effect of explanatory variables at multiple spatial scales (Chee and Elith, 2012;Peterson et al., 2011). The importance of including environmental conditions that have a direct, mechanistic effect on species distributions has been emphasized as a means to improve model interpretability and to improve understanding of the mechanisms underlying species distributions (Austin, 2002). This principle is difficult to implement in practice due to limited data availability, but is especially important when the models are intended to inform stream management.
Moreover, the presence-absence and abundance (i.e., count) observations of fish species distributions often exhibit statistical properties that pose additional challenges when applying classical statistical SDMs. Studies applying SDMs in wider ecological contexts often assume species presence-absence or count observations to follow a binomial or Poisson distribution, respectively. However, count distributions of fish may exhibit overdispersion (i.e., the variance is greater than the mean) due to a patchy distribution of individuals. The Poisson distribution assumes that its mean and variance are equal, which yields biased parameter estimates when used to model overdispersed counts (Martin et al., 2005;Zuur et al., 2009). Typically, the negative binomial distribution is used to account for overdispersion (e.g., Warton, 2005), but may still be inadequate for explaining overdispersed counts with excess zeroes (Potts and Elith, 2006).
To model counts of species exhibiting overdispersion and many zeroes, zero-inflated models 1 (ZIMs) have been proposed that effectively combine a binary outcome model (i.e., with a Bernoulli distribution) that predicts excess zeroes with a model for count data (i.e., following a Poisson or negative binomial distribution) (Martin et al., 2005;Wenger and Freeman, 2008;Zuur et al., 2009). ZIMs have been applied to marine fish species for some time (e.g., Stefánsson, 1996), often with the aim of better informing fish stock assessments and fisheries management (Cosandey-Godin et al., 2014;Thorson et al., 2016;Walsh and Brodziak, 2015). More recently, ZIMs have been applied to freshwater fish, with initial studies comparing the performance of classical statistical techniques with their zero-inflated counterparts (Lewin et al., 2010;Vaudor et al., 2011). Increasingly sophisticated applications of ZIMs have been proposed to model freshwater fish distributions, including models that include species observations with multiple size classes (Kanno et al., 2012), account for non-linear responses of species to environmental conditions , and quantify environmental conditions at multiple spatial scales (Stewart-Koster et al., 2013). Hierarchical model structures have also been proposed to account for spatial autocorrelation among environmental conditions due to the hierarchical structure of the river network (Boone et al., 2012). However, while studies applying ZIMs to marine fish observations have incorporated the observation process (i.e., the equipment used and sampling effort) into their proposed models, there are relatively few studies that propose similar models for freshwater species .
In this study, our overarching aim is to develop conceptual models of fish autecology and use them to develop statistical models of fish species distributions (i.e., fish SDMs). More concretely, we develop conceptual models to summarize our prior knowledge of dominant natural and anthropogenic environmental conditions that have a direct, mechanistic effect on freshwater fish species. We then propose two distinct statistical models to predict the occurrence of different fish taxa, namely a presence-absence model (PAM), and a zero-inflated model (ZIM) to predict fish density (i.e., fish count per unit area fished), respectively. Although we use the conceptual models to identify environmental conditions that can potentially be included as explanatory variables in the statistical models, we ultimately select explanatory variables among these that provide the best predictive performance for each species. In pursuing the overarching aim of this study, we address the following research questions: 1 To what extent can we predict the occurrence and density of freshwater fish in a region with a large range of environmental conditions, from densely populated areas with intensive agriculture to alpine regions (based on currently available data from different monitoring programs and based on different fish sampling methods)? 2 Which of the available environmental variables are most important for predicting occurrence and density of the most common fish taxa and do they reflect prior knowledge?
In developing the PAMs and ZIMs for multiple fish taxa, we combine several innovations. We include the observation process in the structure of the statistical models to combine observational data collected using different sampling methods (i.e., with qualitative, semi-quantitative, and quantitative fishing). In classical approaches (e.g., Carle and Strub, 1978), the "true" density of fish is estimated from repeated samplings. These estimates are then used to calibrate a model for the "true" density. In our approach, the measurement process is included in the model, which predicts the "true" density as internal state and provides the estimated fish caught as additional model output. This makes it possible to consider all sources of uncertainty during the calibration 1 Although early ecological applications of zero-inflated models are highlighted, zero-inflated distributions were first proposed by Aitchison (1955) and later applied by Lambert (1992). process. The same applies for occurrence in the PAM. In addition, we quantify habitat fragmentation (due to natural and anthropogenic barriers in the river network) and influence factors that act at different spatial scales (e.g., reach and accessible area). With this study, we explore what we can learn from model-based analysis of currently available biomonitoring data and identify limitations and gaps to support river management.

Observational data
We combined data from the Swiss National Surface Water Quality Monitoring Program (NAWA) (Kunz et al., 2016) and the Progetto Fiumi program (Brodersen and Seehausen, 2014) (Table 1). Both programs aim to describe fish community structure in Swiss streams and rivers through quantitative, semi-quantitative or qualitative electrofishing. Sites in the Progetto Fiumi program were chosen to include all Swiss drainages and the entire altitudinal gradient, which are habituated by fish (203-2297 m.a.s.l.). During quantitative sampling, block nets were installed at the start and end points of the fished reach and two or three rounds of fishing were done, with all fish caught included in the sample data and thus allowing for population estimates of individual taxa. Semi-quantitative fishing was similar to quantitative fishing efforts, but without the use of block nets and a single round of fishing. During qualitative sampling, one fishing round was done along a reach without the use of nets. In qualitative sampling, the reach was not entirely fished and not all captured fish were included in the data, but all captured taxa were recorded.
In total, 55 species were recorded. As the taxonomic status of several of the most important riverine fish in Central Europe is currently being revised (e.g., Kottelat and Freyhof, 2007;Lucek et al., 2018;Palandačić et al., 2017) and species-specific field records are therefore often unreliable, we aggregated all species at the genus level (see SI section 1.1 for further explanations).
Based on the combined Progetto Fiumi and NAWA data, we derived the abundance (i.e., count) and presence-absence observations of fish taxa in each sample. Due to the difficulty of modeling rare taxa (Guisan et al., 2006;Potts and Elith, 2006;Sor et al., 2017), we selected seven taxa that occur in 10% or more of all samples (Table 2) for model development.

Conceptual model
Based on literature sources and consultations with biologists on the autecology of the selected fish taxa, we developed conceptual models that show the current knowledge about dominant natural and anthropogenic environmental conditions that drive the distribution of fish taxa throughout their major life stages (see Fig. 1 for Salmo spp. and SI section 1.4 for similar conceptual models for additional taxa). The main goal of these conceptual models is to inspire the development of statistical models based on causally linked explanatory variables to the degree possible (Schuwirth et al., 2019). We included environmental conditions and processes in the conceptual models regardless of their data availability (e.g., fish stocking, impacts of angling on fish populations, predation), while acknowledging that numerous additional factors for which there is little or no available data could also be included (e.g., prevalence of parasites and proliferative kidney disease, effects of hydropeaking). However, the lack of data for specific environmental conditions (e.g., water quality variables such as fine sediment loading) does not exclude the possibility of indirectly quantifying their effect on the distribution of fish taxa (e.g., by including agricultural land use indicators as explanatory variables in our statistical models).

Model definition
In this section, we introduce two statistical models: first, a presence/ absence model that takes observation errors into account, and second a model that predicts the fish density (i.e., the number of individuals per square meter) at a site but is calibrated on the observed counts (i.e., abundance) of a given fish taxon. Both models include an observation process (based on the fishing method) in the model structure. For all model definitions the following indices are used: The time point of sampling t i at a site is needed to represent the 11% of sites that are repeatedly fished. However, for simplicity we omit this index when defining the presence-absence and zero-inflated models.

Presence-absence model
The PAM consists of two parts: a generalized linear model that predicts the probability of occurrence of a given fish taxon based on environmental conditions, and an observation model that describes the probability of observing the taxon based on its probability of occurrence, the number of fishing rounds, and the probability of catching and correctly identifying an individual that is present. The model structure is visualized as a network in Fig. 2.
The presence or absence of a fish taxon at site i is encoded by the variable Y i , which equals one if the taxon is present or zero if the taxon is absent. We describe the presence-absence observations with a conventional logistic regression where the probability of occurrence π i is given by with the intercept α, selected environmental conditions as explanatory variables x ik , and coefficients β k that quantify the taxon-specific responses to the environmental conditions. It is a common approach to derive observations of Y i directly from count data. However, counting fish is prone to errors due to the incomplete sampling of all individuals present and potential misidenti- Note: The number of sites and samples included in the model for a taxon depends on the availability of environmental data during model selection, and in turn affects the prevalence of a taxon in the datasets.   fication of taxa. To represent this in the model, we make a distinction between the presence-absence observations Y obs i which are based on the count data and the "true" presence-absence of a taxon Y i . An observation model links the two variables: The probability to catch at least one individual at site i is represented by the true positive probability p true,i . This probability is derived from a taxon-specific probability p catch of catching an individual and the number of fishing rounds n rounds i : This derivation requires the distribution of the number of fish P(N), which was approximated with the empirical distribution of the count data across all sites.
There is also a (typically low) probability that a taxon is erroneously observed due to misidentification. This is modeled by the false positive probability p false . The two probabilities p catch and p false are based on expert judgment and not inferred from the data.

Fish density model
The ZIM predicts the number of observed fish based on a zeroinflated distribution for the fish density and an observation model that accounts for the fished area, the number of fishing rounds, and the catch probability (Fig. 3).
For a given taxon, the fish density ρ i (i.e., the number of fish per square meter) at a site i is modeled by two components: i) a linear combination of the environmental explanatory variables x count ik with parameters α count and β count k (i.e., the count component; Zeileis et al., 2008) and ii) a normally distributed, site-specific random effect ∈ i . This results in: The site effect ∈ i can be interpreted as a residual term, and would be included to quantify the intrinsic uncertainty of the fish density at each site due to environmental conditions not included in the model or due to biotic interactions. Despite this appealing interpretation, we eventually decided to omit the site effects because of the difficulty of defining appropriate informative priors, which would be needed to avoid identifiability problems during parameter estimation (see Discussion).

Including zero-inflation
With the exception of Salmo spp., other taxa include observations with a large proportion of zero counts (SI Fig. 1) -more than what can be explained by the selected environmental variables for fish density. These excess zeros can arise if the taxon is absent at the time of sampling a site or the limited area fished relative to the spatial and temporal scale of the species movements (Martin et al., 2005). Alternatively, the habitat conditions at a site may be suitable for the ecological preferences of a taxon but is otherwise inaccessible due to habitat fragmentation, including physical barriers within the river network (e.g., dams or weirs). To model these excess zero densities, a Bernoulli distributed random variable is introduced If π zero i equals one, the zero-inflated fish density ρ ′ i at a site is zero: This zero component is similar to a logistic regression applied to presence-absence observations, i.e. a generalized linear model with a logistic link function. Note that although we distinguish between environmental variables for the count and zero components in our model structure, an explanatory variable may be included in both components.

The observation process
To link the fish density with the observations, the model includes an observation process. The mean number of fish N i that can be potentially caught at a site is the product of the zero-inflated fish density ρ ′ i and the fished area A i (m 2 ): However, the number of fish that are actually caught is influenced by the number of sampling rounds n rounds i and the taxon-specific probability p catch of catching an individual. The expected total number of individuals to catch in a sample is calculated as The randomness of the catch procedure is often modeled with a Poisson or negative binomial distribution. Because we cannot exclude the possibility of overdispersion (Martin et al., 2005;Zuur et al., 2009), the observations are assumed to follow a negative binomial distribution

Model performance
We evaluated the performance of the presence-absence and zeroinflated model for each taxon by quantifying (a) the quality of fit of each model during calibration using all samples in the data and (b) the predictive performance of each model during k-fold cross-validation. The quality of fit and predictive performance of the models is quantified with the metrics defined below.

Quality of fit
Assuming independent observations, the likelihood of the presenceabsence model is and the likelihood of the zero-inflated model where Θ represents all ZIM parameters (α zero , β zero ,α count ,β count , p catch ,ϑ). Because the number of observations I may vary across taxa, we standardize the likelihoods The quality of fit of a model is quantified by the ratio of the standardized likelihood of the proposed model (i.e., including the environmental variables and their respective parameters) and the likelihood of the simpler null model (i.e., a model including only the intercepts α zero and α count ): This quantity expresses how much more likely an observation is on average based on the proposed model compared to the null model. Alternatively, the quality of fit of the PAM for each taxon can be quantified using the standardized deviance of the model predictions from the presence-absence observations The explanatory power of the environmental variables selected as model inputs in the presence-absence model for a taxon can be assessed using the D 2 statistic (similar in interpretation to the R 2 of a linear model assuming a normally distributed response; see Guisan and Zimmermann, 2000). The D 2 is calculated as the fraction of null model (i.e., a model with only an intercept α) deviance that is reduced by the proposed model deviance: We did not use the deviance or D 2 for the ZIMs, because it is not clear how to specify the saturated model (Millar, 2011). Instead, we quantify the quality of fit of the ZIMs using the log-likelihood (logL) of the proposed model and the likelihood ratios L ratio .

Predictive performance
We quantified the predictive performance of the PAM and ZIM for each taxon by k-fold cross-validation. In k-fold cross-validation, we randomly partition the full dataset into k subsets, calibrate the proposed and null PAM or ZIM to every combination of k − 1 folds, and obtain model predictions from the independent subsample k. The predictive performance of the model over the independent subsamples is then quantified by the total log-likelihood (of the k testing data sets) divided by the number of observations. We chose k = 5 to obtain sufficiently large sample sizes during prediction on the independent data (not used to calibrate the model) that ensure more robust estimates of predictive performance.
In addition, for the PAMs we calculated the average explanatory power (D 2 ) over the k folds during calibration and prediction on independent data (in calculating the mean explanatory power during prediction, the null model is calibrated using the independent subsample). The performance of the ZIMs during cross-validation was furthermore quantified with the standardized likelihood ratio L ratio for testing and training data over all folds.

Preparation of potential variables
Based on the mechanistic conceptual models we developed for each fish taxon, we selected potential variables for which data was available or which we could indirectly derive from existing data. An inspection of the Pearson correlations (r) between the potential variables revealed no strong correlations (i.e., |r| < 0.7; see section 1.2 in the SI for the pairwise correlations among potential variable for all samples), with the exception of the stream width and depth variability below (see Table 3).
As potential explanatory variables, we included the habitat conditions (e.g., substrate, flow velocity) at each site (Table 3). In addition, we performed extensive spatial analyses of the river network to quantify the accessible habitat area available from each site at multiple spatial scales (1-10 km at 1 km intervals, however due to strong collinearity we used a maximum distance of 2 km for variable selection) as well as the accessibility of nearby lakes, by taking into account natural and artificial barriers as impassable obstacles in our analyses. We obtained estimates of the maximum morning stream temperature in summer (i.e. annual morning maximum) in each accessible reach from a simple linear model (see Table 3 for details), and calculated the mean within the spatially accessible habitat area. We included a quadratic term (Temp 2 ) to model taxa that prefer intermediate stream temperatures and respond negatively to both low and high temperatures.
Similar to our spatial analysis of stream temperature, we used available data from stream morphology assessments (BAFU, 2006) throughout the river network to quantify the mean morphological state of reaches accessible from the sampled sites. The spatial network analysis was implemented in ESRI's ArcGIS 10.7, with additional post-processing performed using the sf package (Pebesma et al., 2020) in the R statistical computing environment ver. 3.6 (R Core Team, 2020).

Selecting variables based on predictive performance
To select a set of environmental variables from Table 3 that maximize the predictive performance of the presence-absence and zeroinflated model for each taxon, we performed an exhaustive search procedure that combines a best subsets regression analysis (James et al., 2013) with 5-fold cross-validation. During this procedure, we constructed models containing all possible combinations of p parameters (while excluding models with Pearson correlations |r| > 0.7 from further analysis) and applied 5-fold cross-validation to each model for each taxon. The exhaustive search was applied to PAMs with between 1 and 10 potential variables and to ZIMs with between 1 and 4 potential variables. Testing the predictive performance of the ZIMs was limited to 1-4 potential variables due to the count and zero components in the model structure leading to a large number of combinations of variables in the model that quickly became computationally intractable (despite the use of parallelized processing) with additional potential variables. For the variable selection, we used models that do not explicitly account for the observation process to make use of standard R packages that have a short run-time (see next section) during maximum-likelihood estimation. We identified models with the highest predictive performance based on the total log-likelihood during independent predictions over the k-folds, and verified that parameter estimates of the top three performing models were consistently positive or negative over the five folds.

Model implementation and parameter inference
Throughout most of the model development process, we used the R statistical computing environment ver. 3.6 (R Core Team, 2020). During variable selection, the parameters α, β j in the presence-absence models were identified by maximum likelihood estimation with an iterative Table 3 Environmental conditions identified as potential explanatory variables for the presence-absence model and zero-inflated model of each taxon (including the untransformed minimum, mean, and maximum values).

Abbreviations (units) Description
HabitatArea (km 2 ) The total accessible habitat area of reaches within a maximum network distance (2 km) of the site, considering natural and anthropogenic barriers ≥ 50 cm high as impassable. Channel widths for accessible lengths of each reach were obtained from morphological assessments or based on the mean channel width of reaches in the same stream order (min: 0, mean: 0.08, max: 0.47 km 2 ).
Temp (   Note: The environmental variables above do not vary with repeated samplings t i at site i, due to limitations in data availability. For example, stream temperature had to be estimated from spatial variables, i.e. catchment size and mean catchment elevation. Each variable was centered by their mean and normalized by their standard deviation (x k = (x ik − x k )/σ k ) to reduce correlations among the marginal posterior parameter distributions and thereby improve parameter inference, and to maintain parameter estimates relative to the units of the environmental variables. NAWA habitat data for the variables Gravel, Pool, and HidingSpots was combined with equivalent variables in Progetto Fiumi by assuming the ordinal values in NAWA data represent the following proportions of the fished area: none (0%), low (10%), recurrent (30%), recurrent/ frequently (45%), frequently (60%).
weighted least squares algorithm in the glm function in R and using the null model parameters as starting values. If the iterative weighted least squares algorithm produced parameter estimates with a proposed model deviance greater than the null model deviance, we applied a more robust optimization method with the optim function in R to identify the maximum likelihood solution (as in Nelder and Mead 1965). For the zero-inflated models, we used the zero-inflated negative binomial model implemented in the zeroinfl function of the pscl package (Zeileis et al., 2008) in R. The parameters Θ were identified by maximum likelihood estimation using the quasi-Newtonian Broyden--Fletcher-Goldfarb-Shanno algorithm implemented in the optim function in R (Broyden, 1970;Fletcher, 1970;Goldfarb, 1970;Shanno, 1970).
Based on the results of our variable selection, we implemented the selected PAMs and ZIMs (including the observation process) in Stan (Carpenter et al., 2017) and accessed it through the R package rstan (Stan Development Team, 2018). The joint posterior probability distributions of the model parameters were sampled by doing Bayesian inference with an adaptive No-U-Turn Hamiltonian Markov chain Monte Carlo algorithm Duane et al., 1987). The prior distributions of the parameters are provided in Table 4. We used rather wide priors for the parameters α, β, ϕ to reflect for our limited prior knowledge.

Variable selection
The variable selection procedure provided insight into statistical associations between the occurrence and abundance of taxa and environmental variables. During the variable selection procedure of the presence-absence models, for all taxa the top three models in terms of predictive performance each included stream temperature within the accessible habitat (see Table 5). The models for Cottus spp. and Phoxinus spp. include a quadratic term that leads to a slight curvature of the response curve within the relevant temperature range covered in the data (see Section 3.3 and SI Fig. 14). Fish taxa exhibited consistently positive responses to additional spatially-explicit variables such as the accessible habitat area (HabitatArea) and the mean width variability within the accessible habitat area (WidthVar), while two taxa exhibited a negative response to the mean depth variability within the accessible habitat area (DepthVar).
Contrary to our expectation based on the conceptual models, three taxa responded negatively to overhead shelter in fished area (Hiding-Spots) and four taxa responded negatively to the proportion of forest in the catchment. With the exception of Barbatula spp., the taxon-specific responses to agricultural (Farm and LUD) and urban land use were

Table 5
Variables in the three top-ranked presence-absence models based on predictive performance (quantified by the log-likelihood during testing). Individual variables are coloured based on whether the maximum likelihood estimates of the β parameters were positive (blue), negative (red), or inconsistent (black; i.e., positive and negative, depending on the fold) during calibration on the five training datasets. The results for Salmo spp. are not shown, because even the top models did not perform better than the null model.

negative.
The top-performing zero-inflated models of Salmo spp. include a quadratic term for temperature and negative responses to accessible habitat areas, while also revealing consistent positive responses to the proportion of urban land use in the catchment and to the proportion of hiding spots in the fished areas (Table 6). In contrast, the ZIM models of several other fish taxa include a positive response of the predicted counts to the accessible habitat area and/or a negative response to overhead shelter (HidingSpots).
The results of the selected ZIMs for Barbus spp. are not discussed here, due to poor model performance compared to the null-model (see Section 3.2 Model Performance for results and discussion).

Model performance
The selected presence-absence models show a similarly good quality of fit for all taxa (except for the widespread Salmo spp.) due to the high explanatory power of the selected environmental variables in each model (Table 7). Moreover, the performance of the selected models is quite good during cross-validation, with only minor decreases in mean explanatory power D 2 and standardized likelihood ratios L ratio during predictions on out-of-sample testing data.
For the taxa with intermediate prevalence, the explanatory power of the selected environmental variables in the PAM lead to models that predict probabilities of occurrence that closely matched the observations, and were similar during calibration (training) and prediction (testing) of 5-fold cross-validation (Fig. 5).
The geographic distribution of the predicted probabilities of occurrence of the presence-absence models can be used to identify specific sites or regions where the model predictions are consistent with or diverge from the observations (see Fig. 6 for an example for Cottus spp. and section 1.6 in the SI for other taxa).
The zero-inflated models generally performed better than the null model for all taxa except Barbus spp., as indicated by the likelihood ratios above one for testing data during cross-validation Table 8.

Table 6
Environmental variables included in the three top-ranked zero-inflated models based on predictive performance during 5-fold cross-validation. Individual variables are colored based on whether the maximum likelihood estimates of the β parameters of the count component were positive (blue), or negative (red) during calibration on each of the folds. Note that negative β parameters in the zero component have a positive effect on the predicted abundance and vice versa, therefore the color coding for the zero component is inverted.

Table 7
Performance of the presence-absence model for each taxon, including the quality of fit during model calibration using all data and performance during 5-fold cross-validation based on the average D 2 to illustrate the explanatory power and the likelihood ratio L ratio for the testing data over all folds. The prevalence (Prev) refers to the data set used for calibration with the omission of samples with missing environmental variables and can therefore differ from

Taxon-specific responses
The maximum posterior parameter estimates during calibration of the PAMs for each taxon show taxon-specific responses during calibration using all samples (Table 9). All taxa responded positively to mean maximum morning summer stream temperature within the total accessible habitat (Temp) in the relevant range covered in the data set (9-25 • C) (see Figure 14 in the SI for the responses of the taxa that include a quadratic term, i.e. Cottus spp., and Phoxinus spp.).
The maximum posterior parameters of the ZIM models for all taxa for which the model predicted better than the null model are provided in Table 10. In addition, the marginal posterior parameter distributions of the count and zero components illustrate the uncertainty in the parameter estimates (Fig. 7 for Salmo spp. and SI Figures 15-19 for other taxa). For example, for Salmo spp. the marginal posterior estimates do not overlap with zero, indicating significant responses to the selected environmental conditions. This taxon has no variables selected for the zero component. The count component shows the strongest response to the accessible habitat area, followed by hiding spots, temperature  . A good quality of fit is indicated by large blue dots and small red dots.

Table 8
Performance of the zero-inflated model for density of fish taxa, including the quality of fit during model calibration using all data (with the log-likelihood of the proposed model given as logL) and likelihood ratios L ratio during calibration and 5-fold cross-validation. The prevalence refers to the data set used for calibration with the ommission of samples with missing environmental variables and can therefore differ from  Figure 14 in the SI).

Discussion
In this study, we set out to model and predict the occurrence and density observations of riverine fish using natural and anthropogenic environmental conditions in a region with a high variation in topography. The sites are diverse including alpine streams with mainly hydromorphological alterations and lowland rivers in densely populated areas with intensive agriculture. To achieve this aim, we developed mechanistic conceptual models of fish autecology to summarize dominant environmental variables that could be included in a statistical species distribution model. We then developed a presence-absence model and a zero-inflated count model for each fish taxon, incorporating the observation process (i.e., fishing method) and sampling effort (fished area) into the structure of each model. This enabled us to integrate data from different programs and account for the uncertainty of the observation process during calibration.

Overall performance
In selecting environmental variables to include in the PAM and ZIM for each taxon based on a simplified model, our results show that we can explain and predict the occurrences of riverine fish with an intermediate prevalence using a presence-absence model rather well. By contrast, the PAM performs poorly for the highly prevalent Salmo spp. (prevalence of 95%). The dependence of the PAM performance on prevalence is consistent with previous findings (Sor et al., 2017), demonstrating that it is very difficult to predict absence observations for taxa that occur almost everywhere or presence observations for taxa that are very rare with a statistical model that relies on the information content in the data. For this reason, rare taxa are usually excluded from statistical species distribution models (Sor et al., 2017).
The ZIM predicted reasonably well compared to the null model for most taxa (with a likelihood ratio for testing data between 1.1 and 1.6, indicating that the likelihood is 1.1 to 1.6 times larger for the proposed model than for the null model on average for each data point). Barbus spp. was the only taxon with a standardized likelihood ratio below one, indicating that the ZIM model predicts worse than the null model due to overfitting.
The performance of the ZIM may be further improved by additional environmental variables that were unavailable for this study, by increasing the number of data points with non-zero counts for taxa with low prevalence, and/or by extending the variable selection procedure, which was restricted to 1-4 environmental variables due to computational limits (but see next sections).
Initially, we tested the inclusion of random site effects in the ZIM to account for uncertainties in the fish densities in addition to uncertainty by patchiness and the observation process (Fig. 3). However, the absence of prior information about the variance of the site effects and the variance of the negative binomial distribution for counts led to difficulties identifying these two sources of error due to the low information content in the data. We therefore omitted the random site effects from the ZIM.

Taxon-specific responses in the models
The top-ranked PAMs for all taxa included stream summer morning temperature, with consistently positive parameter estimates during cross-validation. The PAMs for Squalius spp., Barbus spp., Phoxinus spp., and Gobio spp. showed clear positive responses to stream temperature that are consistent with existing knowledge of habitat preferences among Cyprinids, while the positive response of Cottus spp. in the PAM contrasts with the expected optimum temperature of 10-11 • C (Fig. SI  3). The inferred optimum summer morning temperature for Salmo spp. around 18 • C (Fig. SI 14) in the ZIM is roughly in agreement with prior knowledge (Jonsson and Jonsson, 2009). Additional spatially-explicit habitat conditions appeared frequently in the top-ranked models, including the mean channel width variability and depth variability Table 9 Maximum posterior parameter estimates of the presence-absence models during calibration using all samples. Positive values (blue) indicate an increase of the probability of occurrence with the corresponding environmental condition and negative values (red) a decrease. Black numbers indicate that the sign was not consistent across the 5-folds during cross-validation. The larger the value the stronger is the response. Results of Salmo spp. are not shown, because the model did not predict better than the null model.
within the accessible habitat. The negative responses of Barbus spp. and Phoxinus spp. to depth variability and concurrent positive responses to width variability should be interpreted with care because the correlation between these two variables was close to the threshold of exclusion (Pearson correlation coefficient of 0.7, see section 1.3 in the SI). The unexpected negative responses among many taxa to hiding spots in the fished area in the PAMs suggests that habitats with undercut banks may provide more shelter for piscivorous fish, which can negatively impact e. g. cyprinid fish, through increased predation pressure (e.g. Walser et al., 1999). This is further supported by the positive response of Salmo spp. abundance to hiding spots in the ZIM. The total habitat area, which is determined by the presence (or rather absence) of migration barriers, had a positive association with the presence and abundance of Cottus spp. and Barbatula spp. and a negative association with Salmo spp. abundance. Since Salmo spp. is subject to stocking in Switzerland (Borsuk et al., 2006), it is plausible that it is more likely to maintain populations in fragmented habitats as compared to other species and that local extinctions of competitor species may lead to higher densities (Holmen et al., 2003;Keeley, 2001). Previous studies also showed that habitat fragmentation can alter fish community structure in streams (e. g. Perkin and Gido, 2012). The models indicated positive associations between urban land use in the catchment and Salmo spp., Cottus spp., Phoxinus spp. abundance and negative associations with the presence of Barbus spp. Agricultural land use indicators had different effects on taxa presence and abundance in the models, while most taxa showed a negative association with the proportion of forest cover in the catchment. For example, the livestock unit densities had a negative association with Phoxinus spp. and opposing effects on Cottus spp. in the PAM and the ZIM. The proportion of arable land (Farm) had a negative association with Cottus spp. and Barbus spp. presence, and positive associations with Squalius spp. and Barbatula spp., the latter of which is expected to be tolerant to organic pollution (SI Fig. 5).
Although potential mechanistic pathways that can lead to these responses are known, their relative importance is uncertain. For example, the negative responses of specific fish taxa to arable land use may be attributable to impaired water quality (e.g. organic matter inputs leading to oxygen depletion), stream morphology (e.g., clogging with fine sediments), or altered stream hydrology, while some taxa may profit from higher stream productivity due to nutrient inputs. To disentangle these effects, more detailed water quality data for the fish monitoring sites would be needed.

Table 10
Maximum posterior parameter estimates of the ZIM for all taxa with good model performance during calibration using all samples. Parameter estimates are colored according to positive (blue) or negative (red) effects on predicted counts, see Table 9. Note that positive β parameter values of the zero component increase the probability of zeros and therefore have a negative effect on modelled probability for presence (and vice versa), therefore the color coding is inverted.

Conceptual and statistical models
The discrepancies between the environmental conditions included in our conceptual and statistical models may arise for two major reasons. First, our conceptual models are more mechanistic than our statistical models because (i) we included environmental influence factors in the conceptual models that have a direct, mechanistic effect on the selected taxa and regardless of data availability, and (ii) for specific taxa such as Salmo spp., Squalius spp., Barbus spp., Barbatula spp., the conceptual models are structured according to major fish growth stages. Despite these differences, we acknowledge that our conceptual models may be incomplete or contain a high degree of uncertainty due to limited knowledge of the effects of specific natural environmental influence factors (e.g., prevalence of disease and parasites) and anthropogenic influences (e.g., stocking and angling) on the development and distribution of specific taxa.
Second, not all environmental variables included in the conceptual model could be included in the statistical model due to the limited availability of data (e.g., water quality parameters). We used the conceptual models to identify potential environmental variables, and selected variables to include in the statistical models (i.e., the PAM and ZIM) for each taxon based on the data availability and predictive performance of candidate models. Differences between the conceptual and statistical model (in terms of the selected variables and taxon-specific responses) may indicate that our prior knowledge in the conceptual model is incorrect or that the explanatory variables are not accurate enough, especially those that were estimated from other factors or are only indirectly linked to stressors, such as land use. For the ZIM, we cannot exclude the possibility that the variable selection procedure was not comprehensive enough, because we tested only models with 1-4 explanatory variables, due to computational limitations. However, given that even the relatively simple selected ZIMs led to overfitting for Barbus spp., this is (at least for Barbus spp.) unlikely to be the cause of the poor predictive performance.

Improving the zero-inflated model
Additional analysis of the ZIM is necessary to further improve predictive performance and to identify more intuitive methods of illustrating model performance. The standardized likelihood ratios of the zero-inflated count models for the testing data during cross-validation below one for Barbus spp. indicate an overfitting of the model, despite the low number of parameters included, and a failure of the model to predict the counts for independent data points. The skewed distribution of observed counts with a long tailing (see Observed Counts in the SI) already indicates that it will be difficult to predict the few observations with high counts with a rather simple statistical model based on environmental conditions, especially if the number of available data points with observations above zero are low. Increasing the sample size of the observational data to increase the information content relating fish densities to environmental conditions may improve model performance. In the future, it might be worth trying to model biomass instead of counts. Fish biomass can be expected to have a more even distribution, because observations of high numbers can be caused by a large number of juveniles with low biomass.
We found that many studies applying ZIMs to fish use information- theoretic statistics (such as BIC and DIC) to evaluate and select a model (e.g., Arab et al., 2012). Instead of penalizing the likelihood of a candidate model based on increasing model complexity (i.e., adding environmental variables to the model), we opted to quantify the performance of the ZIMs using the likelihood during calibration using all samples and 5-fold cross-validation. While the standardized likelihood ratio indicates how much more likely the observations are in the proposed model relative to the null model, it would be valuable to also visualize the results by comparing observations and model predictions. However, since the zero-inflated densities can have a bi-modal distribution (see Fig. 4), it is not straightforward to visualize and summarize model predictions (e.g. based on quantiles) (but see SI Fig. 20 for an attempt to visualize the distributions of the predicted vs. observed counts).
In conclusion, the use of conceptual models for the presence/absence and densities of fish taxa contributed to the pre-selection of environmental variables and the development of statistical models that reflect plausible cause-and-effect relationships. We were able to develop presence-absence models for taxa with intermediate prevalence and a ZIM for the most prevalent taxa that have a reasonable predictive performance, incorporating spatially-explicit environmental variables and making use of monitoring data with different sampling methods and levels of sampling effort to account for the observation process.
We have shown that statistical models can indicate potential positive or negative effects of environmental factors on the occurrence and abundance of common riverine fish taxa. Many of these factors are currently subject to management actions or influenced by global changes. Examples for such management actions are the removal of barriers to fish migration (O'Hanley et al., 2013), the morphological restoration of rivers (Haase et al., 2013), and changes in agricultural practices to reduce pesticide and nutrient inputs (Acero Triana et al., 2021). The models developed in this study are far from able to accurately predict the effect of a specific management action on a local fish community, which would additionally require considerations of the colonization potential, biotic interactions and less common taxa. Still, this study may help to adjust expectations towards management actions under changing environmental conditions. For example, according to our results, increasing stream temperatures can be expected to increase the occurrence of most common fish species, while the removal of barriers and increased morphological variability could lead to a species turn over. With this attempt for "mechanistically inspired" statistical models, we hope to encourage the use of increasingly available fish biomonitoring data to learn about fish responses to multiple stressors in a changing environment.

Credit author statement
BC: Bogdan Caradima AS: Andreas Scheidegger JB: Jakob Brodersen NS: Nele Schuwirth BC lead the data preparation, statistical analysis, and writing. AS contributed to the statistical methodology, model development, and model concept diagrams. JB contributed to data documentation and curation, fish taxonomy, and to the writing with biological interpretations of results. NS contributed to the writing, statistical analysis, and study methodology and design.

Declaration of Competing Interest
The authors have no conflicts of interest to declare in this study.