CNN for image-based sediment detection applied to a large terrestrial and airborne dataset

: Image-based grain sizing has been used to measure grain size more efficiently compared to traditional methods (e.g. sieving and Wolman pebble count). However, current methods to automatically detect individual grains (e.g. BASEGRAIN(Detert and Weitbrecht, 2012)) are largely based on detecting grain interstices from image intensity which not 10 only require a significant level of expertise for parameter tuning but also underperform when they are applied to sub-optimal environments (e.g. dense organic debris, various sediment lithology). We proposed a model (GrainID) based on convolutional neural networks to measure grain size in a diverse range of fluvial environments. A data set of more than 125,000 grains from flume and field measurements were compiled to develop GrainID. Tests were performed to compare the predictive ability of GrainID with sieving, manual labeling, Wolman pebble counts (Wolman, 1954) and BASEGRAIN (Detert and Weitbrecht, 15 2012). When compared with the sieving results for a sandy-gravel bed, GrainID yielded high predictive accuracy (comparable to the performance of manual labeling) and outperformed BASEGRAIN and Wolman Pebble counts (especially for small grains). For the entire evaluation dataset, GrainID once again showed fewer predictive errors and significantly lower variation in results in comparison to BASEGRAIN and Wolman pebble counts and maintained this advantage even in uncalibrated rivers with drone images. Moreover, the existence of vegetation and noise have little influence on the performance of GrainID. 20 Analysis indicated that GrainID performed optimally when the image resolution is higher than 1.8 mm/pixel, the image tile size is 512*512 pixels512*512 pixel*pixel and the grain area truncation values (the area of smallest detectable grains) were equal to 18 - 25 pixels


Introduction
Sediment grain size and its spatial variability are fundamental in river dynamics (e.g.sediment transport, channel evolution), ecological studies (e.g.aquatic habitat; fishery) and river restoration engineering.However, the measurement of grain size has been time consuming and laborious especially in mountain rivers due to the wide range of grain size classes, diverse grain lithology, the hiding of grains, diverse structures and the influence of organic materials.The most widely used grain-sizing method is sieving (Kellerhals and Bray, 1971) and is used as a benchmark to other methods when reliable sediment samples are able to be collected (Church et al., 1987).Wolman (1954) proposed a pebble count method (Wolman method) that samples a minimum of 100 pebbles from the riverbed surface with a grid-based system.Limited to material > 8 mm (Kellerhals and Bray, 1971), the Wolman method has been especially popular in the field due to the limited equipment required and its benefit of reducing sampling times whilst providing a relatively valid estimation of reach-scale grain-size distribution.Since then, various versions of the Wolman method were proposed with different approaches to collecting stones such as the random walk approach for particle collection (Leopold, 1970), superimposing gravel templates upon the sedimentological unit for reduced operator error (Bunte and Abt, 2001), and image-based Wolman method analysis (Hassan et al., 2020;An et al., 2021).
Since the 1970s, advances in high resolution photography have provided scientists the opportunity to estimate sediment grain size in river beds from images, largely reducing sampling time for large-scale field surveys compared to sieving and Wolman methods (Church et al., 1987;Adams, 1979).However, the development of such methods to measure grain size from images has been challenging as early studies relied on the laborious manual identification of grain boundaries on vertical images (Adams, 1979;Ibbeken and Schleyer, 1986) and only within the last 20 years has there been the development of automated grain sizing algorithms (Graham et al., 2005b;Buscombe et al., 2010;Rubin, 2004).Generally, image-based automated grain sizing methods can be classified from percentile-based to object-based methods (Buscombe, 2020).Percentile-based methods (Carbonneau et al., 2004;Rubin, 2004;Buscombe, 2020;Buscombe et al., 2010) estimate grain size distribution based on statistical analysis of image intensity and texture through pixel-wise simple autocorrelation algorithms (Rubin, 2004), grain size prediction as a function of both local image texture and semi variance (Carbonneau et al., 2004), spectral decomposition of an image (Buscombe et al., 2010) and convolutional neural networks (CNN) (Buscombe, 2020;Mueller, 2019;Lang et al., 2021).Object-based methods (Sime and Ferguson, 2003;Detert and Weitbrecht, 2012;Graham et al., 2005a;Graham et al., 2005b;Mcewan et al., 2000) apply sequences of grain separation algorithms to detect grain interstices and identify each individual grain on the bed.Mcewan et al. (2000) applied an automatic edge-detection algorithm on Digital Elevation Models (DEMs) of grain surfaces generated by laser scanning and reported promising grain-size measuring results.Sime and Ferguson (2003) presented a modified edge-detection algorithm which combined both edges seeding and partial watershed segmentation algorithms.Graham et al. (2005aGraham et al. ( , 2005b) ) proposed a double threshold interstice-detection approach in which the threshold levels to detect grain interstices were initially defined based on image intensity distribution and further refined through a bottom-hat filter.Based upon this approach, Detert and Weitbrecht (2012) proposed an enhanced grain detecting model (named as BASEGRAIN) which applies a five-step image-processing procedure to separate grains on the bed.
As noted by several researchers (e.g., Carbonneau et al. (2004), Graham et al. (2010) and Buscombe (2020)), object-based methods require sophisticated object segmentation algorithms and theoretically cannot be used on grains smaller than one pixel, however, object-based methods can provide grain-scale information on spatial variability which is essential in not only predicting but also understanding the processes of flow resistance (Chen et al., 2020), sediment transport (Yager et al., 2018) and aquatic habitat (Reid et al., 2020).The BASEGRAIN model developed by ETH Zurich is a state-of-art object-based grain sizing software, but it requires extensive parameter tuning (the model contains more than 40 adjustable parameters and seven key parameters) and a significant level of expertise to be applied to sub-optimally captured images.Moreover, the model only focuses on detecting edges and as such performs poorly in fluvial environments where dense organic debris, various sediment lithology, and non-uniform lighting are present in the photo (Detert and Weitbrecht, 2020).The limitations of BASEGRAIN in these suboptimal environmental conditions can be overcome using Convolutional Neural Networks (CNN) which have been extensively used in computer vision (Krizhevsky et al., 2012) and biomedical applications (Ronneberger et al., 2015).Through repeated convolutions and pooling on the input images, CNN can automatically capture not only object edges but also highlevel features such as shape, color and texture (Buscombe, 2020).In addition, with nonlinear activation functions (e.g.sigmoid) in every neuron, the network is capable of learning the nonlinearity of grain features under diverse environments.When trained with large sets of images, CNN techniques have proven to be a robust tool for object classification and identification (He et al., 2016) even when applied to sub-optimally conditioned images (e.g.non-uniform lighting, noise due to organic debris).
For image segmentation tasks, one of the most widely-used CNN architectures is U-Net (Ronneberger et al., 2015), which was designed to separate individual cells in biomedical images.U-Net has been successfully applied to solve many problems such as multi-organ segmentation (Oktay et al. (2018), detection of lung abnormalities (Kohl et al. (2018) and autonomous driving (Tran and Le (2019).The detection of grains is different with the tasks above in regards to the wide range of grain size classes, diverse grain lithology and the hiding of the grains, the potential of U-Net to detect sediments in diverse fluvial environments has not yet been studied (Mueller, 2019).For field grain size measurements especially in watershed-scale drone surveys, the size of large boulders to be detected can be several magnitudes larger than the size of fine sediments, however, the scale and resolution of input images to U-Net were limited by GPU memory and model complexity.As such, predictive errors arise when splitting the large images into sub-tiles for predicting fine sediments.Meanwhile, inter-granular noise is introduced due to the diverse lithology and weathering, for example, the internal texture for weathered rock tends to be falsely detected as grain interstices.As a result, how can we reduce errors when applying U-Net for grain detection in a diverse range of fluvial environments?How does image resolution and image tile size influence the predictive ability of U-Net?What is the size of the smallest detectable grain unit for U-Net?These questions have yet to been answered.Therefore, it is of great value to develop a U-net-based model for grain size measurement in diverse fluvial environments.
In this paper, we propose a model (GrainID) based on U-Net with an overlap-tile strategy to detect grain size from images in a diverse range of fluvial environments.To achieve our goal, we (i) compiled a large dataset of grain images containing more than 125,000 grains in a diverse range of fluvial environments and trained GrainID with the datasets, (ii) compared the results of GrainID with sieving, manual labeling, Wolman and BASEGRAIN methods, (iii) tested the performance of GrainID for uncalibrated rivers with airborne photos, and (iv) evaluated the influence of vegetation, inter-granular noise, image tile size, and resolution on model performance.

Data
The datasets (84 flume, 118 field photos) cover a wide range of fluvial environments and include a variety of field site and flume experiment images.As shown in Table 1, the datasets were grouped into three subsets according to sediment and channel conditions: (1) Flume channel (84 photos; photo size ~ 0.2m*0.2m);(2) Forested mountain rivers (70 photos; photo size ~ 1m*1m); and (3) Sparsely vegetated large rivers (6 photos; photo size ~ 20m*20m).To train the machine learning model to better distinguish sediments from field environmental elements (e.g.cohesive sands, wood, vegetation and water) and improve the model robustness, we specifically collected 42 field photos primarily consisting of various environmental elements with limited sediment grains in the images .

Flume channel:
The first flume set (SAFL dataset) is collected in a riffle-pool experiment (Fig. 1a; flume size: 2.8m * 55m) carried out in the St. Anthony Falls Laboratory (SAFL) at the University of Minnesota (Singh et al., 2013).The channel bed samples were primarily composed of a sandy-gravel mixture created by adding sand to the clean gravel mixture and turning the bed over.Bed surface samples were then collected and sieved using the Klingeman Sampling protocol (Kondolf, 2000;Klingeman and Emmett, 1982).The second flume set (MCHEL dataset) consists of 33 flume photos taken from a step-pool experiment (flume size: 0.4m*5m) carried out in the Mountain Channel Hydraulic Experimental Laboratory (MCHEL) at The University of British Columbia (Fig. 1b).A non-uniform sediment mixture with a wide grain size distribution between 0.5 and 64 mm (measured by sieving) was used.The sediments in MCHEL are painted in different colors to classify different grainsizes, but the issue of wearing on the grain surface introduces inter-granular noise like the noise introduced by different grain lithologies in the field.
Forested mountain rivers: 70 grain photos (Brayshaw, 2012;Helm et al., 2020) were collected in 18 small forested gravel-bed rivers (basin area < 100 km 2 ; Fig. 1c) in British Columbia, Canada.Visual assessments suggest that a large proportion of the channels were hidden beneath a dense forest canopy composed of both coniferous and deciduous tree species (Fig. 1c), with a channel slope ranging from 0.007 to 0.184, and the sediments cover a wide range of sedimentary, metamorphic, intrusive and extrusive lithologies (Brayshaw, 2012;Hassan et al., 2014).The grain size information in Table 1 for forested rivers was calculated by Brayshaw (2012) using the Digital Gravelometer software proposed in Graham et al. (2005b).
Sparsely vegetated large rivers: Six UAV photos were collected by our research group in two large mountain rivers: Upper Yangtze River (Fig. 1d) and Yaluzangbu River from China.The photos were taken along the riverbank in which they were influenced by the presence of water, waves and cohesive sediments.There was sparse vegetation in the images and the sediments were primarily composed of moderately weathered silicate mineral.
The datasets of 202 images were randomly split into two subsets (Table 1): a training subset (for training and validation) with 136 images and a test subset with 66 images.The training subset was further split into training and validation datasets with a 5-folds cross-validation method during the model training process, and the test subset was a true holdout set to test the model's predictive ability with new images.To evaluate the influence of vegetation and inter-granular noise on model performance, the test subset was further grouped based on the presence of vegetation and inter-granular noise, in which GrainID, BASEGRAIN and the Wolman method were tested for each of the data groups.As shown in Table 1, the tested images with / without vegetation were marked with the superscript v / nv while the tested images with/without inter-granular noise were marked with the superscript i / ni .

Manual labeling
Manual labels were created for all grain images as baseline labels to train and evaluate the methods.Manual labeling was chosen as it is a robust method when applied to diverse fluvial environments due to its basis on human cognitive analysis and it has been widely used as baseline method for grain detection studies (Sime and Ferguson, 2003;Graham et al., 2005a).Figure 2a-2d are the examples of two field images and the corresponding manual labels.Fig. 2a shows a bed with vegetation (Anderson Creek), and Fig. 2c shows a bed without vegetation but with inter-granular noise due to grain lithology.The grains were marked as white pixels isolated from each other and the interstices are marked as black pixels (Fig. 2b, 2d).For grains covered by vegetation, only the exposed part was labeled, and grains with area of 23 pixels were chosen as the smallest grains to be labeled (Detert and Weitbrecht, 2012).As shown in Fig. 2, the images are large enough to capture the grain size distribution even with the presence of vegetation in the image.A total of 128461 grains were marked for the entire dataset of 202 images (67612 in the training datasets, 60849 in the test datasets) by two operators, in which operator-1 created 170 images and operator-2 created 33 images.To ensure the quality of manual labels, a cross-check labeling workflow was used.When an operator finished labelling an image, the labels would be double-checked by the other operator (the inspector), missing grains found by the inspector would be confirmed by both two operators, and only those consensus 'missing grains' would be added to labels.
To explore the consistency in labeling and estimate human related errors, five human operators (including operator 1 and 2) were asked to label a fixed dataset of 12 photos containing 8000+ grains in diverse fluvial environments.The photos are selected from Table 1, in which six are from Forested mountain rivers, three are from the SAFL dataset (Singh et al., 2013), two are from the MCHEL dataset, and one is an airborne photo from Yaluzangbu River.
Boxplots were applied to describe the variation of predicted grain size between the operators.The boxplot displays the fivenumber summary of a set of data including the maximum, third quartile, median, first quartile, and minimum (from top to bottom). Figure 3 shows the boxplot of normalized grain size Dnormalized (Dnormalized = (D -Dmean) / Dstd) for different grain percentiles and different operators, in which D is the predicted grain size for a manual label, Dmean is the mean grain size value of 12 photos chosen for analysis, Dstd is the standard deviation of grain size value of the 12 photos.As shown in Fig. 3, the five operators showed consistent median, first/third quantile and maximum/minimum values for all Dnormalized statistics and all grain percentiles, indicating the consistent predictive ability of the five operators for grains in diverse environments.An exception is D50, in which operators 2 and 5 showed a higher maximum value of Dnormalized than the other three operators.The inconsistency for D50 prediction mainly arises from the predictions for the three photos from the SAFL dataset in which the bed contains a lot of fine grains, and operators 2 and 5 overestimated the D50 by merging fine grains as larger sediments.The analysis suggests that operator 1 produced consistent grain size for all percentiles, but operator 2 may overestimate D50 for images with fine grains.Overall, the manual labels datasets prepared by operators 1 and 2 were consistent with labels created by human operators.

Model framework
A model framework (GrainID) to detect grains from images in diverse fluvial environment was introduced in this section.Fig. 4a shows the framework of the GrainID model working in its 3-step procedure, and Table 2 lists the detailed description of each processing step.For image pre-processing (step 1), we tried three image filters in the Python Image Processing Library: pillow (Clark, 2015): edge enhancement, sigmoid contrast, and detail, in which the Sigmoid contrast filter was chosen for its lower predictive error.Image augmentation (Fig. 4c), a widely-used technique for CNN prediction, allowed the network to learn variances in object location, rotation or deformation without the need to see these transformations in the annotated image corpus.
The CNN prediction for the border region of an image is invalid as the convolution used mirroring context rather than real image information of the border for prediction purposes (Ronneberger et al., 2015).As such, errors are introduced when splitting a large photo into many image tiles for U-net prediction.To solve this problem and achieve seamless prediction, in step 'Image extrapolation-2' and 'Image split', we applied an overlap-tile strategy (Ronneberger et al., 2015).The overlap-tile strategy only utilizes the central parts of an image tile to be used for valid prediction.For example (Fig. 4b), for image tiles (red and blue dash rectangles, 512*512 pixels) used for U-net inputs, only the center parts of U-net outputs (red and blue solid rectangles, 256*256 pixels) were accepted for predictions.To achieve seamless prediction, we created overlapping image tiles for our U-net inputs as dashed red and blue rectangles in Fig. 4b in step 'image split', and the missing context in the border region was extrapolated by mirroring the border region in step 'image extrapolation-2' (shown as shadow region in the image in Fig. 4b).
In step 2, image tiles created by our overlap strategy were then input into U-Net for prediction.The predictions were then recombined into a full image.The final CNN prediction was calculated as a result assembly voted by predictions of the five augmented images, in which the voting rule was that a pixel will be calculated as an interstice if two or more predictions identify an interstice at that pixel so that the model can detect grain interstice and separate grains as much as possible.Four post-processing algorithms were performed in step 3 in which holes inside grains were filled and grains with area < 20 pixels were filtered.To compensate the error of wide interstices due to human labeling, the interstices between grains were narrowed for 2 pixels using an inverse watershed algorithm.Finally, to further separate the merged grains, a watershed algorithm was performed based on grain centroid information.
For a predicted image, the a-axis (major-axis) of a grain was defined as the maximum Euclidean distance between two pixels on the grain boundary, and the b-axis (minor-axis) was calculated as the maximum intercept to the grain along a line perpendicular to the a-axis.Based on the b-axis and grid-by-area method (Kellerhals and Bray, 1971), sediment percentiles D5, D16, D50, D84 and D95 were calculated for the results of manual labeling, GrainID and BASEGRAIN.The sediment percentiles of the Wolman method were calculated based on a grid-by-number method equivalent to the grid-by-area method demonstrated by Kellerhals and Bray (1971).

Manual sieving, BASEGRAIN and Wolman methods
The model proposed in this paper was compared to the manual sieving, Wolman and BASEGRAIN methods.The three methods were considered because they are widely used and accessible.
The manual sieving method was applied to bed samples from the SAFL dataset.Sediment samples were first weighed for a total mass and then sieved through a sieve set (mm): 32; 22.6; 16; 11.3; 8; 5.6; 4; 2.83; 2; 1.4; 1.The sediments of each sieve as well as the fine sediments left in the pan were weighed once again, and the weight percentage of each size fraction were calculated (Singh et al., 2013).
The image-based Wolman method samples 100 grains based on an equidistant grid on the image where the sediment distribution was calculated via a grid-by-number approach that has been applied in many literatures (Kellerhals and Bray, 1971;Hassan et al., 2020).
The BASEGRAIN applies a five-step image processing algorithm to detect grains (Detert and Weitbrecht, 2012): In step ( 1) -(3), the model sequentially applies the (1) double grayscale threshold, (2) morphological bottom-hat transformations and (3) the Canny and the Sobel methods to detect grain interstices.In step (4), an improved watershed algorithm is performed for grain segmentation.In step (5), grains with an area < ~23 pixels are excluded and grain properties (e.g.a-axis, b-axis, orientation) are calculated.During the calibration process, BASEGRAIN includes seven decisive tunable parameters.In image processing step (1), the double grayscale threshold filter includes three key parameters: the size of a median filter (medfiltsiz10) and two gray-thresh values to estimate possible interstices (facgrayhr1 and facgrayhr2).In step (2), the bottom-hat filter includes two decisive parameters: the size (medfiltsiz20) and the criteria (criteriCutL2) of the filter; The remaining two key parameters are for the watershed algorithm, including the minimum grain size (areaCutLfA) and the minimal allowed length of a bridge in watershed algorithm (areaCutWW).
We followed the user guide (Detert and Weitbrecht, 2013;Detert and Weitbrecht, 2020) for model calibration.The seven key parameters were tuned sequentially from image processing step (1) to ( 5), in which medfiltsiz10 was tuned first and areaCutWW was tuned last.The optimal parameters were chosen by adjusting the seven key parameters to get the best visual segmentation.Among the seven tunable parameters, facgraythr1 and criteriCutL2 are the most decisive parameters for processing suboptimal images.For images with non-uniform lighting, inter-granular noise or organic debris, the facgraythr1 was set to lower than 0.4 (default 0.8), and the criteriCutL2 was set to larger than 20 (default 2) to avoid over-split.No manual segmentation was applied to BASEGRAIN output.Please see Detert and Weitbrecht (2012) and Detert and Weitbrecht (2020) for more information on BASEGRAIN implementation.

Model evaluation
The predictive ability of GrainID was compared to sieving, manual labeling, BASEGRAIN and Wolman count for images in the test datasets in section 4. The grain size distribution was calculated for the predicted images of manual labeling, GrainID, BASEGRAIN and Wolman count.The predictive error for grain percentile Di for a tested image is defined as, (2) where Di,baseline and Di,predicted denote the baseline value and predicted value of Di, abs() denote the absolute value.
Mean and median predicting error are used to evaluate the performance of different methods, where Erri, mean and Erri, median are mean value and median value of Erri for photos in the test datasets.Variation of predictions were measured in two ways, of which Vi, 3rd-1st and Vi, max-min denote the variations of third quartilefirst quartile and maximum -minimum for Di.
For the comparison between the sieving and other image-based methods, we applied a projective approach (Fujita et al., 1998) to transform the original images to orthophotographs and relate pixel locations to physical distance (image resolution = 0.45 mm/pixel).The orthophotographs were used as input to the image-based methods for grain size prediction and the predicting result was transferred to physical grain size base on image resolution.

Evaluating the predicting ability of image-based grain sizing methods in diverse fluvial environments
We first compared the predictive ability of four image-based methods to manual sieving as it has been established as the most reliable grain sizing method (section 4.1).Subsequently, we tested the predictive abilities of GrainID in diverse environments based on the entire test dataset (section 4.2).Then, we tested the applicability and robustness of GrainID with a dataset of uncalibrated rivers with airborne photos (section 4.3), in which the dataset is from a different environment (sparsely vegetated large rivers) and different photography method compared to the images in the training dataset (terrestrial photos).Finally, we evaluated the influence of vegetation and inter-granular noise on model performance (section 4.4).In section 4.1, the manual sieving method was used as our baseline measurement.For the analysis in section 4.2, 4.3 and 4.4, manual sieving data was unavailable for the field datasets.The manual labeling was also used as baseline method, as the method is a robust grain sizing method and has also been widely used as a baseline method for grain detection studies (Sime and Ferguson, 2003;Graham et al., 2005a;Ronneberger et al., 2015).

Performance compared to sieving method
The dataset from SAFL (Singh et al., 2013) was compiled to evaluate the performance of image-based methods compared to the manual sieving method.Figure 5a-5d show a sample photo of the flume bed (Fig. 5a), the labels of Manual labeling (Fig. 5b), GrainID (Fig. 5c) and BASEGRAIN (Fig. 5d) predictions.As shown in Fig. 5a, the flume bed contains a lot of fine sediments.GrainID can predict sediment of a wide range of different sizes, whilst BASEGRAIN performs well for large grains but fails to predict fine grains.

Comparison of GrainID, BASEGRAIN and Wolman in diverse environments
The entire test dataset (Table 1) was used to evaluate the performance of GrainID, BASEGRAIN and Wolman methods in diverse fluvial environments.In Fig. 5, we present photos (Fig. 5a, 5e, 5i, 5m), and the predictive results of manual labeling (Fig. 5b, 5f, 5j, 5n), GrainID (Fig. 5c, 5g, 5k, 5o) and BASEGRAIN (Fig. 5d,5h,5l,5p).The photos cover a variety of environments in which Fig. 5a is a flume sandy-gravel bed, Fig. 5e shows a flume bed with a wide grain size range and with inter-granular noise, Fig. 5i shows a forested riverbed with vegetation debris (from a small mountain watershed) and Fig. 5m is a drone photo of a large mountain riverbank.
A rough comparison shows that GrainID successfully predicts grains with inter-granular noise (Fig. 5g), while BASEGRAIN falsely recognizes the inter-granular noise as grain boundaries and splits those grains into smaller ones (Fig. 5h).When there is vegetation, GrainID distinguishes grains from large wood elements (Fig. 5k) while vegetation debris is frequently falsely predicted as grains by BASEGRAIN (Fig. 5l).For Fig. 5m, even with water in the image leading to some predictive error due to limited training for this uncalibrated site, GrainID performs well for all grain size groups (Fig. 5o).With BASEGRAIN, the error due to water was partly overcome ascribe to human expertise during the parameter tuning process, but the model falsely merges some small grains in the images (Fig. 5p).
As shown in Table 3, for small grains D5, D16, D50, GrainID outperforms Wolman and significantly outperforms BASEGRAIN in terms of both Erri, mean and Erri, median.For D84 and D95, GrainID and Wolman show similar performance while BASEGRAIN shows slightly lower performance than the other two methods.As for prediction variation (Fig. 6b), BASEGRAIN shows significantly larger variation Vi, 3rd-1st than the GrainID and Wolman methods for all grain percentiles.
When comparing the change of predictive error versus grain percentiles, Wolman and BASEGRAIN both show larger predictive error for small grains than for large grains.In contrast, GrainID shows similar consistent performance for all grain percentiles.The results indicate that GrainID is a more accurate and robust grain sizing method (especially for small grains) than BASEGRAIN and Wolman methods for diverse fluvial environments.

Performance of GrainID in uncalibrated sites with airborne photos
To test the predictive ability of GrainID in uncalibrated rivers, 13 drone photos were compiled for sparsely vegetated large rivers (Table 1).As shown in Table 3, GrainID shows slightly lower performance for all grain percentiles than its performance in diverse environments, where most of the evaluated images (53 out of 66) were from calibrated sites.Inversely, BASEGRAIN shows slightly higher performance in these conditions in comparison to its performance in diverse environments whilst the predictive error for Wolman in these rivers was similar to its predictive error in diverse environments.Once again, BASEGRAIN and Wolman consistently underestimate grain size (Fig. 6c), and show similar overall performance in terms of Erri, mean and Erri, median.GrainID shows evidently outperform the two methods for all grain percentiles.As for prediction variation (Fig. 6c), GrainID and Wolman show similar variation in terms of Vi, 3rd-1st, and BASEGRAIN shows larger variation than the other two methods.The results suggest GrainID shows better predictive ability than BASEGRAIN and Wolman method even in uncalibrated rivers.

Influence of vegetation and inter-granular noise
The datasets were grouped based on the presence of vegetation and inter-granular noise in the image (Table 1) to evaluate the influence of vegetation and inter-granular noise on the GrainID, BASEGRAIN and Wolman methods.As shown in Table 3, the existence of vegetation and noise have little influence on the performance of GrainID in terms of both Erri, mean and Erri, median for all grain sizes.Conversely, BASEGRAIN shows larger Erri, mean, Erri, median (Table 3) and prediction variation (Fig. 7b) for environments with vegetation and inter-granular noise.For vegetated environments, BASEGRAIN consistently shows larger Erri, median and Vi, 3rd-1st for all Di compared to its performance in environments devoid of vegetation (Fig. 7b).For environments without the presence of inter-granular noise, BASEGRAIN consistently overestimates grain size for all Di.
Interestingly enough however, when there is inter-granular noise, BASEGRAIN consistently underestimates grain size for all Di (Fig. 7b).The performances of Wolman in the four test subsets in this section were similar for all grain percentiles, where there is limited influence from vegetation and inter-granular noise on the performance of the Wolman method (Fig. 7c).Overall, GrainID showed the smallest Erri, median and Vi, 3rd-1st, while BASEGRAIN showed the largest Erri, median and Vi, 3rd-1s for environments with vegetation and inter-granular noise.

Discussion
In this section, we first discussed the error sources of different image-based methods based on the results in section 4.
Subsequently, we explored the influence of image tile size and image resolution on the predictive ability of GrainID by varying the image tile size and image resolution.Then, the truncation area for the smallest detectable grains was discussed and the model efficiency of different image-based methods was compared.Finally, the limitations of GrainID and future improvements and studies were discussed.

Error analysis
The error sources for image-based grain size measurement methods can be divided into five types: (1) the intrinsic error arising from estimating three-dimensional grains with their projection on a two-dimensional image, e.g. the grain vertical axis can't be detected from a image; (2) errors associated with the image-processing algorithm, e.g. the limitation of the interstice-based algorithm as discussed above; (3) errors associated with sub-optimal environments from vegetation, inter-granular noise and sub-optimal lighting, the boundary of those environmental elements could be falsely detected as grain interstice; (4) errors associated with image tile size and image resolution, the smallest detectable grain size is limited by image resolution; and (5) errors associated with grain size distribution, irregular grain shape and photo distortion, a wide grain size distribution could lead to larger error in detecting fine grains.Among the errors above, error type 1 is present for all image-based methods and has been widely discussed in previous literature (Graham et al., 2010) whilst error type 5 is likely to have little influence on the final prediction results (Sime and Ferguson, 2003;Graham et al., 2005b;Detert and Weitbrecht, 2012).In this section, we will discuss the advantages and disadvantages of manual labeling, GrainID, BASEGRAIN and Wolman methods (error type 2), and discuss how vegetation, inter-granular noise, image tile size and image resolution influence the model's predictive performance (error type 3, 4).
Manual labeling, based on the operator's cognitive ability of identifying the grains is the most robust and reliable method when applied to diverse fluvial environments.The influence of image resolution and image tile size on manual labeling are reduced compared to other models.However, the method is extremely time-consuming and laborious.In addition, the method requires a significant degree of expertise from the operator to correctly identify grains.Labeling error variates from operator to operator (Fig. 3).Moreover, based on the experience of all five operators in our study, when the operators get tired after hours of labeling work, the labeling error usually increases (especially for fine grains) with operator fatigue.Manual labeling has been widely used as a baseline method for grain detection studies (Sime and Ferguson, 2003;Graham et al., 2005a;Ronneberger et al., 2015) and was used for the training and evaluation of models in this study.
The Wolman Pebble count is a semi-automatic grain size measurement method as it requires a manual measurement of at least 100 grains and as a result takes more time to perform in comparison to BASEGRAIN and GrainID.Wolman method shows consistent predicting ability in diverse environments.Vegetation, inter-granular noise, sub-optimal lighting and image resolution have similar influence on the method (Table 3) as seen in Manual labeling methods.However, the predicting ability of Wolman method is sensitive to grain size distribution.The Wolman method shows better predicting ability for large grains (D84, D95) than small grains (D5, D16, D50; Table 3), and the method is limited to material > 8 mm when applied in mountain rivers (Kellerhals and Bray, 1971).
BASEGRAIN, as an automatic grain-detecting model, is less time-consuming than manual labeling and Wolman method and is capable of measuring the spatial distribution of grains.The method has been proven in studies to be a reliable grain size measurement method under optimal conditions (no inter-granular noise, no vegetation, and uniform lighting and dryness) (Detert and Weitbrecht, 2020).For flume experiments with regular sandy-gravel beds, BASEGRAIN shows good performance for predicting large grains when compared to sieving results (Fig. 5b).However, as shown in Fig. 4 and Fig. 5, the model performs poorly in detecting very fine grains (usually less than 50 pixels) even in environments with optimal conditions.In addition, the performance of BASEGRAIN in predicting large grains was highly sensitive to environmental factors such as vegetation and inter-granular noise.BASEGRAIN had poor and inconsistent performance for sub-optimal environments (Table 3), while the model also evidently overestimates grains without inter-granular noise while underestimating grains with intergranular noise (Fig. 7b).The reasons are as follows: although BASEGRAIN applied a well-designed algorithm, as introduced in section 3.3, most of the key parameters are calibrated for detecting object interstice (e.g.grayscale threshold filter and bottom-hat filter).When there is vegetation or inter-granular noise in the image, the BASEGRAIN algorithm intrinsically falsely detects the edges of vegetation or inter-granular noise as the edges of grains (Fig. 4).Moreover, as shown in section 4.1, due to the limitations of image resolution, the boundaries of small grains are unclear and detected poorly with simple thresholds.In addition, the model contains 46 adjustable parameters (in which seven are key parameters) such that BASEGRAIN requires a sophisticated parameter tuning process and a high level of expertise from the operator when applied to suboptimal conditions such as field images.
U-Net, with thousands of neurons and nonlinear activation functions (e.g.sigmoid) in every neuron, is capable of learning the nonlinearity of grain features under diverse environments.Through repeated convolution and pooling on the input images, the machine learning model not only uses grain interstice information but also high-level grain features such as shape, color or texture to make their final predictions (Buscombe, 2020).For field application, the interstice-based algorithms tend to falsely detect environmental elements (e.g.organic debris) and over-split the grains.GrainID is capable of overcoming the influence of environmental elements using grain shape, color or texture features to detect grains.As shown in section 4.2, GrainID evidently outperforms the Wolman and BASEGRAIN for all grain percentiles for a hold-out testing dataset from diverse environments, the advantage of GrainID is more significant for small grains than for large grains (Table 3).Meanwhile, the pooling layer and the drop out training strategy improve the robustness of U-net.When trained based on tens of thousands grains, GrainID makes robust prediction for images filmed from a very different environment (uncalibrated rivers) and by a different photography method (airborne photos) compared to the images in the training dataset (section 4.3).In addition, the architecture (Fig. 4) of GrainID overcomes errors arising from image splits (poorer predicting ability of CNN at the border region of an image tile), making it a promising method for large-scale drone surveys.The analysis on drone photos in section 4.3 showed the potential of applying GrainID in large-scale river survey.Similar to other machine learning methods, the predictive ability of GrainID is highly dependent on the quality of training datasets such as the number and diversity of training images.In section 5.5, we discussed the limitations of GrainID and the issue of lack of training in detail.

Influence of image tile size and resolution
The model's predictive ability will be influenced by whether the size of image tiles are too large (under-split; limited by the GPU memory) (Ronneberger et al., 2015) or small (over-split; limited by the size of largest grain to detect).Based on the forested mountain river and sparsely vegetated large river datasets (Table 1) Based on the SAFL dataset in which manual sieving data was collected (Table 1), we explored the influence of image resolution on grain size detection by down-sampling the original image resolution of 0.45 mm/pixel up to 4.5 mm/pixel and comparing the results of down-sampled images to the sieving results.The down-sampling was done using a simple moving average method of increasing window size from 1*1 up to 10*10 (the later controls the spatial resolution) (Chen et al., 2020).As shown in Fig. 8b, the predictive error was quite consistent (Erri,mean ~ 0.10) for resolutions higher than 1.8 mm/pixel, and increased slowly (from 0.10 to 0.96) for resolutions from 1.8 mm/pixel to 3.15 mm/pixel and sharply for resolutions greater than 3.15 mm/pixel.Erri,mean for small grains were more sensitive to the variable of image resolution than large grains.The analysis showed that for a sandy-gravel bed with D50 = 9.5 mm, GrainID can predict all grain percentiles for image resolutions higher than 1.8 mm but failed to predict grain sizes for resolutions lower than 3.15mm/pixel.

Smallest detectable grains
The ability to detect fine grains is limited by image resolution for all image-based grain sizing algorithms.For the smallest detectable grains, Graham et al. (2005aGraham et al. ( , 2005b) ) proposed that the measurement error increases sharply for grains with a baxis smaller than 23 pixels, while Detert and Weitbrecht (2012, 2020) adopted a grain area of 23 pixels as the lowest truncation value (the area of smallest detectable grains) to detect grains for BASEGRAIN.Based on the SAFL dataset, we calculated the mean predictive error Erri, mean (eq.2) of GrainID in comparison to sieving results for different grain area truncation values (areatrunc).As shown in Fig. 9, Err5,mean (the predictive error of D5) is very sensitive to areatrunc, Err5,mean slowly decreases from 0.22 to 0.19 for increasing areatrunc from 1 to 18 pixels, had the lowest value of 0.19 for areatrunc between 18 -25 pixels, and sharply increases to 0.53 for increasing areatrunc from 25 to 100 pixels.The Err16,mean, Err50,mean and Err84,mean (the predictive error of D16, D50, and D84) are less sensitive to areatrunc compared to Err5,mean.However, they have similar three-stage trends to increasing areatrunc, where the error values first decrease with increasing areatrunc (stage-1), then reach a minimum value for an areatrunc period (stage-2), and finally increase for increasing areatrunc (stage-3).In stage-1, the negative correlation between Erri,mean and areatrunc suggests that the smallest detectable grain for GrainID are grains with an area of 18 pixels.In stage 3, the positive correlation between Erri,mean and areatrunc suggests that the areatrunc is too large so that the correct predictions of GrainID were wrongly filtered out.For D95, similar to the previous findings (Graham et al., 2005a), the result shows that Err95,mean are unaffected by areatrunc and remain almost constant for areatrunc from 1 to 100.The analysis above suggests that GrainID performs optimally when the grain area truncation values were equal to 18 -25 pixels.

Model efficiency
To compare the efficiency of GrainID, BASEGRAIN and Wolman methods, we calculated the time consumed by the three models for predicting images from three typical environments (Table 1): ( 1

Limitations and future work
We tested the robustness and applicability of GrainID by applying it to uncalibrated sites (section 4.3).As our model was trained by more than 65,000 grains under diverse mountain environments, the method was overall robust and outperformed BASEGRAIN and Wolman even for uncalibrated sites.However, the test datasets of uncalibrated sites only included 13 images from four sparsely vegetated mountain rivers.As shown in Fig. 5, due to a lack of training some large wood debris, unresolved cohesive sands, flow wave and drone marker boards were falsely identified as grains by the program.As such, the application of GrainID to more diverse fluvial environments would require more training datasets from a greater variety of environments.
However, preparing training datasets necessitates the use of manual labeling and is therefore time-consuming and laborious.
For some images with dense vegetation, even experienced operators may have trouble confidently identifying grains (especially small grains) in the images.Meanwhile, as seen in many other object-based methods, the smallest grain size identifiable by GrainID is limited by image resolution and the grain pattern learned by the model is limited by image tile size.
How the image tile size and resolution influence the predictive ability of GrainID in a greater variety of environments needs further study.In addition, the present model only identifies the presence of sediment grains in the image in which they were further segmented into pixels either as grains or interstices.We hope that with further development the model can be applied to a greater variety of environments and can identify vegetation, cohesive sand or other environmental elements so that the model can learn to further distinguish different environmental elements in the image.
With the development in photography and the GPU computation techniques in the future, a GrainID trained on a sufficiently large dataset can be directly used for many grain size surveys without specifically training for the study region.For model efficiency, based on parallel computing, there are already successful real-time image segmentation techniques in commercial use such as the introduction of self-driving cars and robotic perception (Treml et al., 2016;Siam et al., 2018).With more studies on improving the accuracy and efficiency of GrainID, the model could be applied to detect grains in video recordings of flume experiments which is very important for studies on sediment mobility and transport in gravel-bed rivers.Meanwhile, our study indicates that GrainID has the potential to be used towards predicting drone photos.With more studies on applying GrainID to drone images, the model could be applied to watershed-scale surveys to study the changes and spatial distribution of grain sizes in a watershed.

Conclusion
We proposed an image-based grain detecting model (GrainID) based on convolutional neural networks to detect sediment grain size in diverse fluvial environments.To develop the model, we compiled a dataset of 84 flume and 118 field photos containing more than 115,000 grains covering environments under a wide range of vegetation coverage, grain lithology and lighting conditions.
Tests were performed to compare the predictive ability of GrainID with the performance of manual sieving, manual labeling, BASEGRAIN and Wolman pebble count methods.When using manual sieving as a baseline result, for a flume experiment with sandy-gravel bel, GrainID, with Erri, median = 0.16, 0.16, 0.16, 0.23 and 0.24 for D5, D16, D50, D84, D95, showed a predictive ability comparable to manual labeling (Erri, median = 0.16, 0.10, 0.15, 0.14 and 0.15 respectively) especially for smaller grains.
GrainID and manual labeling largely outperform BASEGRAIN and Wolman method for smaller grains (D5, D16, D50), but show similar performance with BASEGRAIN and the Wolman method for larger grains (D84, D95).
For the entire test dataset based on a diverse range of environments, when using manual labeling as the baseline result, GrainID showed the overall best performance and maintained its advantage even in uncalibrated rivers, whereas BASEGRAIN showed the overall worst performance.The test datasets were grouped based on the presence of vegetation and inter-granular noise in the image (Table 1) to evaluate the influence of vegetation and inter-granular noise on the three image-based methods.The results showed that vegetation and inter-granular noise have little influence on the predictive ability of GrainID and Wolman methods, while BASEGRAIN showed inconsistent predictive ability and larger Erri, median and Vi, 3rd-1st in environments with vegetation and inter-granular noise.
We also studied the influence of image tile size and resolution on the predictive ability of GrainID.For the forested mountain rivers and sparsely vegetated large river datasets, GrainID with an image tile size = 512*512 pixel*pixel had the best performance.For a sandy-gravel bed with D50 = 9.5 mm, the GrainID performed optimally when the image resolution was higher than 1.8 mm/pixel and the grain area truncation values (the area of smallest detectable grains) were equal to 18 -25 pixels.The analysis also indicated that GrainID had a higher working efficiency than the BASEGRAIN and Wolman methods in terms of processing time.The working efficiency of BASEGRAIN is sensitive to environmental conditions, whilst the average efficiency of GrainID only depended on the size of the input images.Conversely, the average time for Wolman method analysis was constant for different environments.The error sources of different methods were also discussed, and the limitations and potential of GrainID for detecting sands and vegetation, as well as real-time prediction and watershed-scale application deserve further studies and development.Based on the overlap tiles strategy, for prediction of image border region, the missing context was extrapolated by mirroring the border region.
1.3 -contrast filter A Sigmod contrast filter in Python Library pillow was applied.

-image augmentation
The input images were augmented by applying 0 o , 90 o and 180 o counter-clockwise (CCW) rotation and horizontal and vertical flip.Step 2: Prediction 2.1 -U-Net prediction All image tiles were then sequentially input into U-Net for prediction.

-recombination
The predicted image tiles were recombined into a full image.

-assemble vote
The five predictions from augmented images vote for the assemble result.

-filling holes
The holes inside grains were filled.

-filter fine grain
Unresolvable grains with size < 20 pixels were deleted.
3.3-narrowing interstice An inverse watershed algorithm was applied.

-watershed algorithm
A watershed algorithm was performed for further separation.
: a CNN architecture for image segmentation U-Net, evolved from CNNs, is specifically designed for image segmentation application.As shown in Fig. 4d, the U-shaped model architecture consists of two major paths: the contracting path (left part) and the expansive path (right part).The contracting path, similar to the typical CNN architecture consists of a sequence of 3*3 convolution layers for feature extraction and 2*2 max pooling layers for down-sampling.In the expansive path, every operation consists of a transposed convolution layer for up-sampling and two subsequent 3*3 convolution layers, where the transposed convolution layer expands the image and maintains the same connectivity as the regular convolution.With this architecture, U-Net can maintain a consistent image size between the output and input and detect specific objects by doing classification on every pixel.The U-net was implemented based on the python library pytorch (Paszke et al., 2019).The cross entropy loss function and the stochastic gradient descent were used for model optimization.Model hyperparameters were tuned based on grid searching optimization and 5-fold random cross validation (Goodfellow et al., 2016).The training speed for U-net is influenced by the number of images in the training datasets, the batch size and the number of training epoch.Given a fixed training datasets, the hyperparameter number of training epoch was tuned first, followed by the learning rate.The optimum batch size depends on GPU memory and we preferred larger batch size for faster training speed when several batch size values result in a similar error during the cross-validation.The optimized model hyperparameters are: (1) number of training epoch = 150; (2) learning rate = 0.005; (3) batch size = 96; and (4) image tile size = 512.The optimum image tile size was determined based on the analysis in section 5.2.
) SAFL datasets: 26 images from flume experiments with optimal conditions(Singh et al., 2013); (2) MCHEL datasets: 12 images from flume experiments with sediment with inter-granular noise(Wang et al., 2021) and (3) 15 images from forested mountain rivers(Brayshaw, 2012).For GrainID, BASEGRAIN and Wolman methods, the rough averaged time of predicting an image are 5s, 46s and 962s for SAFL datasets; 21s, 300s and 1000s for MCHEL datasets and 22s, 600s and 1000s for the forested rivers datasets (processing time of GrainID depends on GPU, our GPU is GTX 1080Ti).GrainID needs the shortest predicting time and the Wolman method requires a significantly longer predicting time as the model necessitated the Manual labelling of the 100 sampled grains.The predicting time of BASEGRAIN varies in different application environments, BASEGRAIN necessitated much more time for parameter tuning for images from MCHEL and forested rivers than images from SAFL.Images from MCHEL and forested rivers contained significantly different types of images and as such to implement the use of BASEGRAIN required an arduous parameter tuning process and a significant level of expertise.However, it is of value to note that the GrainID requires very long time for cross-validation (~ 40 hours for GTX 1080Ti) and model training (~10 hours for GTX 1080Ti), while BASEGRAIN and Wolman count methods don't need model training.As for model efficiency, the advantage of GrainID lies in that (1) because of the robustness of the model, when the machine learning model is trained based on a sufficiently large dataset, the model can be directly used for a new grain size survey without specifically training for the survey region; (2) for predicting a large dataset (thousands of images), the advantage of GrainID in predicting is evident although it needs days of model training.

Figure 1 :
Figure 1: Four typical environments in our datasets: (a) a bed sample collected in SAFL; (b) a step-pool channel bed in MCHEL; (3) Carnation Creek; (d) Upper Yangtze River.

Figure 2 :
Figure 2: Examples of two field photos and the corresponding manual labels: (a) a photo with vegetation from Anderson Creek and (b) the corresponding manual label; (c) a photo without vegetation from Coquitlam River and (d) the corresponding manual label.

Figure 3 :
Figure 3: Boxplot of normalized grain size Dnormalized for percentiles Di for five human labeling operators.

Figure 5 :
Figure 5: Vertical photos and predicting results of Manual labeling, GrainID and BASEGRAIN for a variety of environments: (ad) flume sandy-gravel bed (SAFL dataset); (e-h) flume gravel bed with inter-granular noise (MCHEL dataset); (i-l) location with dense vegetation (Sullivan Creek); (m-p) drone photo of an uncalibrated large river bank (Yangtze River).

Figure 6 :
Figure 6: Performance comparison for different methods.(a) Dpredicted/Dsieving shown for grain percentiles Di of Manual labeling, GrainID, BASEGRAIN and Wolman methods (referred to as G, B and W methods respectively) for a flume sandy-gravel bed; (b) Dpredicted/Dmanual shown for Di of G, B and W methods for the entire datasets; (c) Dpredicted/Dmanual shown for Di of G, B and W methods for uncalibrated rivers with drone photos.

Figure 7 :
Figure 7: Ratio of predicted to baseline grain size value shown for different Di for (a) GrainID, (b) BASEGRAIN and (c) Wolman method in environments with/without vegetation and inter-granular noise.

Figure 9 :
Figure 9: Prediction error versus area truncation value of smallest detectable grains.

1. 5
-image split Input images were split into overlapping image tiles (512*512) as dashed red and blue rectangles in Fig 4b.

Table 2 : Description of each image processing step of GrainID
If the size(e.g.2000*2000)of original input image can't be equally split into multiple 512*512 tiles, the image was extrapolated into 2048*2048 based on mirroring the right and down image border region.