Atlantic salmon habitat-abundance modeling using machine learning methods

.


Introduction
Climate change and anthropogenic activities influence fish habitat suitability by changing the river temperature (Isaak et al., 2012), vegetation distribution (Ackerly et al., 2015), food availability (Cameron et al., 2019), water quality (Ritson et al., 2014), and hydrological regimes (Jelovica et al., 2022).Since distribution and abundance of fluvial fish are strongly impacted by the habitat (Armstrong et al., 2003), it is essential to measure the indicators which are reflecting the habitat quality and could support river conservation and improvements (Giorgio et al., 2016).For example, stream depth, substrate, flow, shelter, temperature, and oxygen availability are some of the essential factors that influence the salmon abundance in their various life cycles.Abiotic factors such as riverbed geomorphology, hydrology, water quality, and aquatic environment are complex and have intricate independence.The habitat dataset is complex and the relationships among variables are not necessarily linear, making the modeling of the freshwater communities challenging (Armstrong et al., 2003, Mondal andBhat, 2021).These challenges promote application of data-driven tools such as Machine Learning (ML) techniques (Lee et al., 2003, De'ath et al., 2000), which support nonlinear relations.
Machine learning (ML) is a strong statistical tool to identify nonlinear relationships in natural phenomena (Naghibi et al., 2016).In ecological studies, ML has been applied to model complex species community composition and abundance (Matsuzawa et al., 2023).For example, Mondal and Bhat (2021) modeled the species richness and diversity in eastern and central India using various ML approaches, leveraging abundance and ecological data.Wellman et al. (2020) used machine learning for modeling the ecology of urban birds and their habitats, whereas Xu et al., (2024) used support vector regression, RF, and extreme gradient boosting to predict the phytoplankton biomass using environment variables.In general, Support Vector Machine (SVM) (Kang et al., 2022, Ahmadi et al., 2021, Fan et al., 2017, Park et al., 2015), Random Forest (RF) (Yang et al., 2020, Guo et al., 2019, Woo et al., 2019), and Gradient Boosting Regression (GBR) (Ficsór and Csabai, 2023, Garcia et al., 2018, Welchowski et al., 2022) have been widely used to solve ecohydrological and environmental problems.SVM proposed by Vapnik (1998) is one of the most popular ML techniques to describe nonlinear and complex data and has been used for uncertainty analysis (Liu et al., 2013, Singh et al., 2011).SVM has an excellent generalization performance and produces competitive results with the smallest amount of model tuning (Granata et al., 2017, Hoang et al., 2010).Fan et al. (2017) used SVM to predict the bio-indicators of an aquatic ecosystem in the Taizi River in China, and Kang et al. (2022) applied SVM to estimate the fish assessment index in South Korean rivers.On the other hand, RF method has been used in many research and studies due to its high accuracy and superiority (Ho, 1995, Amit and Geman, 1997, Breiman, 1996).Martínez-Santos et al. (2021) used SVM and RF to predict aquatic ecosystem mapping.Olaya-Marín et al. (2013) applied RF for fish richness in Mediterranean region.RF was an effective approach to assess the stream habitat conditions and demonstrate the seasonal, longitudinal, and local co-occurrence pattern of fish species in Yagawa River (Matsuzawa et al., 2023).Yang et al. (2020) developed a RF model to show the composition of fish species between two reservoirs in Yangtze River China.GBR model has successfully modelled problems with many variables and nonlinear relationships and shown high prediction accuracy (De'ath and Fabricius, 2000).As an example, Leathwick et al. (2006) analyzed the relationships between demersal fish species richness, environment, and trawl characteristics using GBR.Ficsór and Csabai (2023) applied various machine learning models such as RF and GB to predict the distribution of Hydropsyche and explained the impact of environmental factors on the dominant presence of the species.
Although, ML models exhibit significant capabilities in addressing challenges such as small dataset sizes, high dimensionality, and nonlinear problem domains (Ding et al., 2011), their performance and accuracy significantly depend on the size and quality of the training dataset.The habitat datasets are often small, scarce, and imbalanced due to costly and labor measurements.Therefore, it is challenging to precisely solve the problems related to these datasets (Danandeh Mehr et al., 2022, Crisci et al., 2012).To address this issue, we applied ML techniques that are less sensitive to the sample size.Since each model has advantages, we used multiple models and compared their performance to estimate habitat-abundance relationships in juvenile Atlantic salmon (Salmo salar).To enhance the performance of models, we applied a grid search algorithm to select the hyperparameters which control the learning process and model results.The grid search algorithm coupled with K-fold cross-validation (Fayed and Atiya, 2019) is employed to optimize the hyperparameters in the models.This utilization of crossvalidation not only enhances the model's reliability and mitigates overfitting, but also provides a more accurate estimation of the model's generalization performance on the test set (Abobakr Yahya et al., 2019, Danandeh Mehr et al., 2022).
This study aims to model the relationships between abundance of juvenile Atlantic salmon and their fluvial habitat using datasets from the subarctic Teno catchment in the northernmost Scandinavia.Abundance refers to the density of juvenile Atlantic salmon per 100 m 2 in the studied area.Given the relatively small size of the habitat data and the intricate relationship between habitat characteristics and juvenile salmon abundance, we employed a range of ML techniques to model the habitat-abundance relationship of juvenile Atlantic salmon in two distinct age categories: fry (age-0 + ) and parr (age-1 + and older).

Study area and data
The subarctic Teno River (Tana in Norwegian, Deatnu in Sami) forms the border between northernmost Finland and Norway at 70 • N.With a catchment area of 16,386 km 2 , it is one of the largest Atlantic salmon rivers running to the North-East Atlantic Ocean.The mean annual discharge of the river is 177 m 3 s − 1 , with spring flood peaking up to 2000-3000 m 3 s − 1 .The mean annual temperature is between ca.0 to − 3 • C and annual precipitation ranges from ca. 300-500 mm (Koster et al., 2005).Atlantic salmon in the Teno River are distributed over more than 1100 km of the main branch and tributaries.The population complex shows extraordinary diversity by numerous genetically distinct subpopulations (Vähä et al., 2017) and by vast variation in life-history strategies (Erkinaro et al., 2019).
Estimating juvenile Atlantic salmon abundance has been part of the long-term monitoring program in the Teno (Niemelä et al., 2005).Permanent monitoring sites (Fig. 1), have been distributed along three main branches of the Teno system including the Teno main stem and two of its large tributaries: Inarijoki and Utsjoki.Most of the Inarijoki follows the Finnish-Norwegian border.The Teno mainstream starts from the confluence of Inarijoki and another large headwater tributary, Karasjohka.Inarijoki has a length of 153 km with a drainage area of 3,152 km 2 and average monthly discharge of 36.4 m 3 s − 1 .Utsjoki is the largest tributary on the Finnish side of Teno catchment with a drainage area of 1,652 km 2 and mean discharge of 18 m 3 s − 1 .
Habitat data was collected from permanent electrofishing sites, (Fig. 1), in Teno, Inarijoki, and Utsjoki Rivers in two consecutive years from July to October.At each electrofishing site, three points on three transects, two points close to the edges and one in the middle of the site, were selected.Each point measured 0.25 m 2 (0.5 x 0.5 m).The habitat variables i.e., habitat characteristics of the electrofishing sites, include water temperature, average depth (cm), average velocity (cms − 1 ), substrate types, shade types, vegetation, and shelter index.The substrate was categorized into four groups including organic-silt-sand, gravel (2-16 mm), cobble (17-130 mm), stone (131-500 mm), and boulder (greater than 500 mm).The observed shades were mainly boulders and sometimes other structures such as large wooden debris.Vegetation includes moss, algae, and other plants.Shelter was estimated by visually identifying all potential interstitial spaces in the substratum.The depth was measured with a flexible PVC tube (13 mm diameter) where distances of 3, 5 and 10 cm were marked off (cf.Finstad et al., 2007).Spaces deeper than 3 cm (25-100 % of body length of the fish) were counted as a shelter, and three shelter size (depth) groups were identified: 3-5 cm; 5-10 cm; > 10 cm.Juvenile salmon abundance was defined as the number of fish per 100 m 2 (single-pass electrofishing, no removal estimates used) in two age groups: fry (age-0 + ) and parr (age-1 + and older).

Regression and classification models
We employed SVR, RF, and GB as a set of regression techniques along with SVC as a classification tool to model the habitat-abundance relationship of juvenile Atlantic salmon.For the SVC model, fry and parr abundance are classified into distinct classes using their abundance histograms (Fig. 2).The classification of abundance offers a reduction in the complexity of the model's results as opposed to using regression models.We identified two and four distinctive classes for fry and parr abundance, respectively.Given the limited size of the habitat dataset, which comprises 14 habitat variables and only 114 records of data, we developed multiple models to assess and compare their performance.

Support vector Machine (SVM)
The Support Vector Machine technique is based on the dimension theory by Vapnik-Chervonenkis (1971) and provides a robust solution for both regression and classification problems based on a maximal margin hyperplane.SVM finds a dividing hyperplane with maximum margin.For a simple two-dimensional plane, the hyperplane is defined as f(x) = ω T x i +b where ω is the support vector and b/‖ω‖ determines B. Jelovica et al. the offset of the hyperplane from the origin.In the case of non-linear relationships, SVM uses a technique called kernel trick which can project the data into high dimensional space.A detailed explanation of SVM technique can be found in the works of Abobakr Yahya et al. (2019), Rahimian Boogar et al. (2019), and Auerbach and Fremie, (2022).
The performance of SVM relies on the selection of kernel functions and hyperparameters such as gamma and Capacity (C).The common kernel functions are radial basis function (RBF), sigmoid, and polynomial.Notably, RBF exhibits higher efficiency compared to other kernel functions, as highlighted by Yang et al. (2021), making it the preferred choice in this study for constructing the SVR and SVC models.
The hyperparameters Gamma and C play essential roles in the performance of SVM.The accuracy of prediction is influenced by the appropriate selection of these hyperparameters.The Gamma parameter describes the width or slope of the kernel function which controls the complexity of the model whereas C affects the fundamental tradeoff.It is crucial to note that choosing a smaller value of C may result in underfitting (Abobakr Yahya et al., 2019).

Random forest (RF)
The Random Forest algorithm, introduced by Breiman (2001), is  based on the concept of model aggregation to produce accurate predictions for both regression and classification problems.It is one of the most popular machine learning techniques widely employed in environmental studies (Vorpahl et al., 2012, Prasad et al., 2006).The technique has a fast-learning rate and can handle multidimensional datasets (Li et al., 2021).Random forest is composed of numerous binary decision trees that use bootstrapping samples from the training dataset and a random selection of explanatory features at each node (Amit and Geman, 1997, Ho, 1998, Breiman, 2001).RF uses the bootstrap method to divide the original dataset into random subsets.A decision tree is trained independently for each subset.The result is obtained by averaging the predictions of all decision trees (Li et al., 2021).The randomness enforces the model's robustness and improves the learning process by changing from one random partitioned inventory subset to another to obtain the patterns of interest (Prasad et al., 2006).The Random Forest algorithm's performance is influenced by several hyperparameters, including the number of trees employed in the ensemble (Vorpahl et al., 2012).

Gradient boosting regression (GBR)
Gradient Boosting is a versatile technique applicable to both regression and classification problems.The technique relies on a set of weak learners or models such as decision trees.Since it is a boosting method, it builds the model by stages and achieves a single strong ensemble model optimizing a loss function.Friedman suggested the negative gradient of loss function L(y, F(x)) to approximate the loss in a Classification and Regression Tree (CART) where F is an estimate of the function F(x) and {(x 1 , y 1 ), (x 2 , y 2 ), ⋯, (Friedman, 2002, Bühlmann andHothorn, 2007).
The GBR method encompasses several hyperparameters that impact its performance such as number of estimators, learning rate, maximum depth, and minimum of sample leaf.A comprehensive description of GBR and its associated hyperparameters can be found in García Nieto et al., 2021.

Hyperparameter optimization
Although machine learning models are powerful in solving small dataset, high dimensional, and non-linear problems (Ding et al., 2011), the selection of hyperparameters affect their performance.Therefore, hyperparameter tuning is crucial for finding the optimal combination of hyperparameters that maximize the model's performance (Fayed and Atiya, 2019).
There are various approaches to tune the hyperparameters including the grid search algorithm (Fayed and Atiya, 2019), genetic algorithm (Sanz-Garcia et al., 2015), and swarm intelligence optimization algorithm (Adachi and Yoshida, 1995).The grid search involves a K-fold cross-validation widely used to assess the model's parameters.
During the K-fold cross-validation, the training dataset is partitioned into K-folds.A model is trained in sequence on K-1 folds and tested on the fold that is not used during training.This process is repeated for each fold, and the model's score is averaged over the folds.Cross-validation improves the model's reliability, mitigates overfitting, and provides a better estimate of generalization performance on the test set (Abobakr Yahya et al., 2019, Danandeh Mehr et al., 2022).
Table 1 presents the models and the values considered for the hyperparameter optimization.The values and interval boundaries are determined through trial and error, aiming to explore a comprehensive range of possibilities.

Selection of habitat variables and data preprocessing
We defined various scenarios for the selection of habitat variables incorporated in the models.These scenarios play a pivotal role in determining the importance of habitat variables when examined collectively (Scenario 1), individually (Scenario 2), or in various groups (Scenarios 3, 4, and 5) with regards to the models' results.One can choose alternative scenarios based on their specific modeling objectives to assess their impact on the results.Scenarios 3, 4 and 5 were specifically chosen following the model's results obtained for Scenarios 1 and 2. This selection enables a deeper exploration of the significance of the habitat variables on the juvenile salmon abundance.A conceptual map of the study is shown in Fig. 3.
-Scenario 1 includes all habitat variables in the models i.e., water temperature, mean depth, mean velocity, substrate types (organicsilt-sand, gravel, cobble, stone, boulder), shade types (boulder shade, other shade), and vegetation (algae, moss, plants), and shelter index.-Scenario 2 considers each habitat variable individually.This scenario particularly investigates if a certain variable has a higher impact on the juvenile salmon abundance.-Scenario 3 considers only substrate variables.
-Scenario 4 considers only shade and plants.
-Scenario 5 explores a combination of various substrates, shades, and vegetation (a combination of scenario 3 and 4).
Prior to model training, the dataset is divided into the training and test sets, with a split ratio of 20 %.The quality of the dataset would impact the model's output.Hence it is important to remove the outliers and missing values (Ro et al., 2015).Since the number of records is small in the dataset, we only excluded the outliers in the water temperature of less than 10 • C.
A conventional strategy to deal with missing data is to remove the entire rows or columns containing missing values.However, this comes at the price of potentially losing valuable data.A more effective strategy is to infer the missing values from the known part of the data using the most suitable imputation technique.In this study, we deployed the mean strategy to impute the missing numeric values.
Since the dataset values exhibit different ranges, normalization is necessary to ensure optimal training speed and accurate results.We employ Min-max normalization method, which maps all the data to the range between 0 and 1 using equation (1) (Yang et al., 2021).
Where x is the original data, x nor is the normalized data, x max and x min are the maximum and minimum values of the data, respectively.

Habitat data
The linear relationships among habitat variables and juvenile salmon abundance are evaluated using Pearson Correlation Coefficient (PCC) (Fig. 4) which ranges from 0 to 1 with one indicates a strong positive correlation.The PCC revealed no significant correlations between fry abundance and the habitat variables.The highest PCC value of 0.44 is observed between fry and parr abundance.
Among the habitat variables, the shelter index, algae, boulder shade, stone and organic substrate exhibit higher PCC values with parr abundance.Notably, the shelter index demonstrates the highest PCC value of 0.48 with parr abundance.However, the correlations between the habitat variables and parr abundance are not found to be statistically significant overall.
Among habitat variables, a notable high correlation of 0.9 is observed between stone and cobble.Boulder shade exhibits a relatively stronger correlation of 0.6 with stone, and both other shade and algae demonstrate a comparatively higher PCC of 0.6 when compared to the remaining habitat variables.

Juvenile salmon abundance with respect to the depth and water velocity
The distribution of mean water velocity and mean depth with respect to the juvenile salmon abundance graphs are shown in Fig. 5a and b respectively.The fry and parr abundance graphs present the total  abundance estimated across all electrofishing sites for two years.Fry and parr have significantly higher abundance at the velocity ranges between 13 and 56 cms − 1 in comparison to the other velocity ranges.Conversely, their abundance is the lowest at velocities below 13 and above 56 cms − 1 .Fry abundance increases significantly at velocities above 13 cms − 1 , reaches its peak within 24 and 34 cms − 1 and drops at velocities exceeding 35 cms − 1 .On the other hand, the parr abundance reaches its highest at velocities between 13 and 23 cms − 1 and declines at velocities beyond 24 cms − 1 .The mean velocity ranges between 2 and 78 cms − 1 , and 96 % of the studied area has a mean velocity below 56 cms − 1 .
From the mean depth distribution and abundance graphs (Fig. 5b), we observed that the fry abundance peaks at the mean depths below 29 cm which constitute about 40 % of the studied area and it decreases at the depth beyond 29 cm.Parr abundance gradually increases within various depth ranges below 37 cm and reaches its highest at depths between 29 and 37 cm, but significantly declines at depths exceeding 37 cm.Approximately 70 % of the studied area exhibits mean depth ranges between 12 and 37 cm.Both fry and parr have the lowest abundance at depths above 45 cm.

Regression modeling
The SVR, RF, and GB regression techniques are evaluated using the coefficient of determination (R 2 ).The optimized parameter values and corresponding R 2 scores for the validation and test sets are presented in Table 2 for fry and Table 3 for parr.It is important to note that all individual habitat variables (scenario 2), except the shelter index, resulted in negative R 2 values for both the validation and test sets and, thus, are not reported.
For fry abundance, the SVR achieves the highest mean crossvalidation score of 0.58 in scenario 5 which involves substrates, shades, and vegetation.Regarding the test score, SVR performs best in scenario 3 (R 2 = 0.28), which includes solely the substrates.On the other hand, both RF and GB attained the highest mean cross-validation scores (R 2 = 0.33 and = 0.44, respectively) when considering the shelter index (scenario 2), while achieving the highest test scores (R 2 = 0.46 and R 2 = 0.49, respectively) for substrates (scenario 3).Comparing all models and scenarios, SVR demonstrates the highest mean cross-validation score (R 2 = 0.58), indicating a better fit to the validation data.However, GBR demonstrates the highest test score (R 2 = 0.49), suggesting superior generalization performance on unseen data.
The grid search heatmaps for Fry habitat-abundance modeling, (Fig. 6), illustrate the influence of various combinations of hyperparameters on mean cross-validation scores (R 2 ).Darker colors show higher scores.These heatmaps focus exclusively on the models that attained the highest mean cross-validation scores across all scenarios, shown in bold font in Table 2.A subset of hyperparameter values is selected for each heatmap to optimize the clarity and interpretability in presenting the impact of hyperparameter tuning on model performance.
The SVR models exhibit varying performance with different values of The heatmaps for random forest and gradient boosting (Fig. 6c and d, respectively) are presented for shelter index in scenario 2, where the models illustrate the highest mean cross-validation scores (Table 2).The random forest performance is optimal at maximum depth hyperparameter of 2, with consistent performance for larger maximum depth values.
In the case of gradient boosting, performance gradually improves with an increase in the number of estimators, reaching optimal performance at 90 estimators and a maximum depth of 2. The model performance is relatively stable when number of estimators ranges between 50 and 90 and the maximum depth is greater than 2. The performance is notably low when the number of estimators is set to 1 across various suggested maximum depths.
The grid search heatmaps (Fig. 7) illustrate the influence of varying combinations of hyperparameters on mean cross-validation scores (R 2 ) in Parr habitat-abundance modeling.These heatmaps focus exclusively on the models that attained the highest mean cross-validation scores across all scenarios, shown in bold font in Table 3.A subset of hyperparameter values is selected for each heatmap to optimize the clarity and interpretability in presenting the impact of hyperparameter tuning on model performance.The SVR model performance in scenario 3 and 4 (Fig. 7a and b) are notably low for small C values (e.g., 1) across the suggested gamma values.In scenario 3 (Fig. 7a), the model performance improves as both C and gamma values increase, reaching optimum performance at C = 36 and gamma = 225.The model performance is significantly poor with smaller values of gamma (e.g., 1).In scenario 4 (Fig. 7b), smaller values of C and gamma (e.g., 1) result in low model performance.The model performance improves with values of C and gamma increasing and reaches the best performance at 257 and 226 respectively and stays consistent afterwards.
The random forest model performance in scenario 4, (Fig. 7c), is poor for smaller value of maximum depth (e.g., 1) and improves as maximum depth increase, reaching the optimum at 13.The model performance is not sensitive to larger values of maximum depth (e.g., greater than 13).In the case of gradient boosting in scenario 1, (Fig. 7d), performance is suboptimal when number of estimators set to 1, but it gradually improves with an increase in the number of estimators, reaching optimal performance at 900 estimators and a maximum depth of 2.

Classification modeling
The SVC models is applied to predict the fry and parr abundance based on the classes defined using their abundance histogram (Fig. 2).The mean cross-validation and test scores present the accuracy of the model.As explained in section 2.2, two and four classes are considered in the SVC modeling respectively.The mean cross-validation and test scores obtained for the classification models (Table 4 and Table 5) are notably higher compared to the regression models across all scenarios.Among the scenarios for fry (Table 4), the model achieves the highest accuracy when considering the shelter index, with mean crossvalidation and test scores of 91 % and 96 %, respectively.Notably, the test scores remain consistent across all scenarios, except for the substrate types organic, silt, and sand, which exhibit slight variations in performance.
In the parr habitat-abundance modeling, scenario 5 yields the highest cross-validation accuracy which involves substrate, shade, and vegetation variables.Conversely, scenario 2 exhibits lower mean crossvalidation scores compared to the other scenarios.Within scenario 2, most individual habitat variables display similar mean cross-validation accuracies, except for mean depth, stone, and shelter index, which demonstrate higher mean cross-validation scores.Across all scenarios, the test scores remain consistent, except for the algae and moss, which show lower accuracies.

Discussion
In our study at the subarctic Teno catchment, we employed several machine learning (ML) techniques to model the habitat-abundance relationships of juvenile Atlantic salmon.The performance and accuracy of ML techniques can be greatly influenced by the size of dataset.Since collection of long-term habitat data is often costly and laborious, these datasets tend to be scarce and imbalanced in nature (Niemelä et al., 2005, Danandeh Mehr et al., 2022, Matsuzawa et al., 2023).To address these issues, we used ML models which are less sensitive to the sample size and have been proven to overcome the complexity and uncertainties in the data (Davoudi Moghaddam et al., 2020, Liu et al., 2013, Singh et al., 2011, Guisan and Thuiller, 2005).Furthermore, we performed feature scaling (normalization), K-fold cross-validation, and hyperparameter optimization to improve the performance of the models.Despite our efforts, the R 2 scores obtained from the regression models are not notably high which could be due to the small sample size, uncertainties in habitat data and complex nature of species (Matsuzawa et al., 2023).The ML techniques rely on the data and would perform better with large datasets (Chapelle et al., 2002).However, most ecological datasets are small unless they are part of large-scale projects (Mondal and Bhat, 2021).Since data from long-term monitoring may help to improve the performance of the models, therefore, we suggest collecting more local habitat data or combining international datasets from similar river systems to ensure more data records in the modeling.
Grid-Search heatmaps visually depict how different combinations of hyperparameters affect the performance of regression models in our results (see Fig. 7).In SVR model, small C values indicate a wider range for decision boundary which may result in suboptimal performance.Increasing C typically leads to a narrower margin for the decision boundary, which can result in better performance.Gamma controls the curvature of the Gaussian Kernel function.Small values of gamma correspond to smoother decision boundary, leading to a broader acceptance of data points in the calculations.Larger values of gamma can lead to a more complex decision boundary, which might improve performance in cases where the relationship between input and output variables is highly non-linear.Very large gamma values lead to overfitting (Kalita et al., 2023).Therefore, it is necessary to find optimal hyperparameters using e.g., grid search method and cross validation to avoid overfitting (Fayed and Atiya, 2019).In all our SVR models small values of C and gamma (e.g., 1) resulted in poor performance which shows the complex non-linear relationship between habitat variables and salmon abundance.The SVR models' performance improves as the values of C and gamma increase as they reach the best performance.In RF and GB approaches, the depth of decision tree represents the length of each tree.The deeper decision tree means more split permitting the trees to describe more variation in the dataset (Breiman, 2001).In the presented heatmaps for RF and GB models, small value of maximum depth (i.e., 1) leads to low model performance whereas larger values do not necessarily improve the models.The R 2 for the RF models are the lowest among all models across all scenarios except in scenario 2 (shelter index) of fry modeling.In GB, each additional estimator contributes to model complexity and may improve the performance.Lower number of estimators result in suboptimal performance and under fitting (Sagi and Rokach, 2018) which we observed in GB heatmaps.Whereas larger number of estimators result in model improvement (Callens et al., 2020).The feature selection is a helpful pre-processing strategy to build a simpler model and improve the performance (Li et al., 2017).In the regression models, we observed that the selection of specific habitat variables, such as substrates, shades, and vegetation, leads to improved performance.The R 2 values obtained for these variables in scenarios 3, 4, and 5 are significantly higher compared to scenario 2, where the models only consider the individual habitat variables and were unable to establish meaningful relationships between each habitat variables and juvenile salmon abundance.Our results indicated that a combination of habitat conditions, i.e. selection of certain features, such as substrates, shades, and vegetation is more effective than individual variables like water depth or water velocity, which have been reported to impact juvenile salmon in previous studies (e.g., Mäki-Petäys et al., 2004, Binns and Eiserman, 1979, Heggenes, 1990).The inconsistencies in the results of scenario 2 with the existing studies may be due to the robustness of the data.If the data is scarce or does not encompass the various aspects of the task, the learning could fall short, affecting the performance of ML (Mosavi et al., 2018).
To gain a better insight about the individual habitat variables such as water depth and water velocity where the models exhibited low performance, we specifically looked at the mean velocity and mean depth distributions with respect to the salmon abundance in all study sites (Fig. 5).A study conducted by Mäki-Petäys et al. (2002) in the Teno River reported that fry prefer near zero and below 20 cms − 1 velocities, while parr exhibit a preference for velocities ranging between 35 and 80 cms − 1 .Our results (Fig. 5a) reveal some deviations in the abundance patterns of fry and parr compared to the outcomes reported by Mäki-Petäys et al. (2002).For fry, our results indicated significantly lower abundance at near zero velocities and the highest at velocities between 24 and 34 cms − 1 .For parr, the abundance peaked between 13 and 23 cms − 1 and declined as the velocities increased beyond this range.On the other hand, we observed that the abundance of both fry and parr across various depth ranges align with the measurements reported by Mäki-Petäys et al. (2002) where fry and parr optimal depths are 5 to 25 cm and 5 to 35 cm respectively.Nevertheless, approximately 40 % of fry abundance were identified at depths greater than 29 cm, which is outside the optimum reported by Mäki Petäys et al. (2002).These deviations from the expected preferences in mean velocity and mean depth in our dataset may result from environmental factors such as local habitat, competition, or juvenile salmon behavior adaptation in the specific study area (Mäki-Petäys et al., (2002), Rosenfeld et al., 2005), or data quality (Mosavi et al., 2018).The data can be enriched through invariance assessment to obtain the group characteristic (Tsai and Yang, 2012) or by using casually dependent coefficients for handling missing values (Sivapalan et al., 2005).These approaches may enhance the robustness of the data which could impact the model development and performance.In addition, we recommend exploring other advanced ML techniques, such as deep learning, for modeling as in some cases they have proven to be useful in habitat abundance modeling (Ditria et al., 2020).
In this research, to decrease the complexity of regression models and improve the performance, we suggested a practical approach by employing support vector classification technique to classify the fry and parr abundance based on their abundance histograms.Transformation of a regression problem into a classification could improve the model performance (Salman andKecman, 2012, Torgo andGama, 1996).While the classification models do not predict the abundance values like regression models, they offer a useful approximation of juvenile salmon abundance within the specific defined classes.The fry abundance classification model demonstrates high accuracies on both the mean crossvalidation and test sets across all scenarios.However, the accuracies  of SVC models for parr abundance are comparatively lower.This can be attributed to the increased complexity of the model, which considers four distinct classes instead of two in the SVC modeling for fry.While classification models may not capture the full complexity of habitat data, they can still offer valuable insights for understanding and managing juvenile salmon abundance.Atlantic salmon is a critical natural resource and cultural element in Northern Europe (Landauer et al., 2023), supporting local societies, ecosystems, and ecosystem services.In addition, salmon has a special role in the culture of indigenous Sami people in the Teno River catchment (Hiedanpää et al., 2020).Currently, Atlantic salmon stocks in the North Atlantic area are declining (ICES, 2023) and more information is urgently needed to support environmental policy and management decisions for conservation and restoration (e.g.Lennox et al., 2021).This includes an improved understanding of habitat-abundance relationships for different life stages of Atlantic salmon in the riverine conditions.Our study demonstrated that small-size datasets and the presence of complex non-linear relationships between habitat and juvenile salmon abundance could impact the models' performance and reliability.These factors can pose challenges in capturing the relationship between species abundance and habitat conditions and the effectiveness of ML techniques (Danandeh Mehr et al., 2022;Guo et al., 2015;McPherson and Jetz, 2007).The availability of data and the characteristics of specific ecological communities could influence the selection and development of models (Mondal and Bhat, 2021).In future studies, it is crucial to consider factors like data quality and characteristics such as temporal variations, as they introduce uncertainties in both the dataset and the modeling process.It is recommended to consider uncertainty analysis as it is an effective approach to understand the degree of variability associated with models and level of confidence in predictions (Lin et al., 2015).Improved ML based models could help to identify critical locations in the riverine conditions and combined with process-based hydrodynamical modelling, the habitat-abundance estimation could be done even in real-time including varying conditions in the riverine habitat.However, before reaching that analytical level and getting ML results to support decision making, it is important to identify and quantify the sources of uncertainty as they contribute to the variance of ecological predictions (Buisson et al., 2010).Therefore, a comprehensive examination of these factors will enhance the accuracy and robustness of modeling.In addition, machine learning methods often struggle to precisely model the habitat data, as noted by Crisci et al., (2012).Thus, it is crucial to exercise caution when utilizing these models.

Conclusion
Recognizing the complex and non-linear nature of habitatabundance modeling, we employed ML techniques to model the juvenile salmon abundance in Teno catchment, using a relatively small habitat dataset.A comparison between regression and classification models revealed that the relationship between the habitat dataset and juvenile salmon abundance is indeed intricate.Consequently, the regression techniques struggled to fit suitable models to the data, despite their inherent ability to handle complex non-linear relationships.Among the regression models, those incorporating substrates, shades, and vegetation demonstrated higher levels of accuracy.Notably, the support vector classification model outperformed the regression techniques in terms of modeling accuracy.Among the regression models, SVR demonstrates the highest performance.
This study provides insights into the challenges and potential of machine learning techniques for juvenile salmon habitat-abundance modeling in complex habitat environments.The findings emphasize the importance of considering the limitations of machine learning models, particularly in habitat contexts, and the need for further research to address temporal variations and improve the precision of habitat-abundance modeling.Such advancements will aid in the development of robust and reliable tools for fisheries management and conservation strategies, facilitating the sustainable management of Atlantic salmon populations and their habitats.

Declaration of Generative AI and AI-assisted technologies in the writing process
During the preparation of this work, Bähar Jelovica used chatGPT in order to improve the language and fluency of the text.After using this tool/service, Bähar Jelovica reviewed and edited the content as needed and takes full responsibility for the content of the publication.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Teno catchment and sampling sites in the Teno River located on the border between Finland and Norway.

Fig. 2 .
Fig. 2. The classification of fry and parr abundance based on their abundance histogram per 100 m 2 for the SVC model.This approach discerns two distinct classes for fry abundance and four distinct classes for parr abundance.

Fig. 3 .
Fig. 3. Conceptual map of this study including statistical analysis of the habitat variables, various scenarios to select habitat variables for modeling, data preprocessing, hyperparameter optimization and cross validation imbedded in the regression and classification modeling, and model evaluation.

Fig. 4 .
Fig. 4. Pearson Correlation Coefficients among habitat data and juvenile salmon abundance, ranges from 0 to 1 with one indicates a strong positive correlation.

Fig. 5 .
Fig. 5. A) mean velocity distribution, b) mean depth distribution with respect to fry and parr abundance which are estimated across all electrofishing site in two years.

Figure A2 .
Figure A2.Substrate type distributions and juvenile salmon abundance per sampling site (a: Fry abundance, b: Parr abundance).

Table 1
Hyperparameter values employed in the grid search algorithm during crossvalidation.

Table 2
SVM, RF, GBR modeling results for fry with R 2 as a performance metric.(LR: Learning-Rate, MD: Max-Depth, MSL: Min-Samples-Leaf, NE: n-estimators).Bolded numbers indicate the highest mean cross-validation/test scores per model and scenario.

Table 3
SVM, RF, GBR modeling results for parr with R 2 as a performance metric.(LR: Learning-Rate, MD: Max-Depth, MSL: Min-Samples-Leaf, NE: n-estimators, RS: Random-State).Bolded numbers indicate the highest mean cross-validation/test scores per model and scenario.

Table 4
SVC results for modeling fry abundance.Bolded numbers indicate the highest mean cross-validation/test scores per model and scenario.