Development of a Distributed Physics‐Informed Deep Learning Hydrological Model for Data‐Scarce Regions

Climate change has exacerbated water stress and water‐related disasters, necessitating more precise streamflow simulations. However, in the majority of global regions, a deficiency of streamflow data constitutes a significant constraint on modeling endeavors. Traditional distributed hydrological models and regionalization approaches have shown suboptimal performance. While current deep learning (DL)‐related models trained on large data sets excel in spatial generalization, the direct applicability of these models in certain regions with unique hydrological processes can be challenging due to the limited representativeness within the training data set. Furthermore, transfer learning DL models pre‐trained on large data sets still necessitate local data for retraining, thereby constraining their applicability. To address these challenges, we present a physics‐informed DL model based on a distributed framework. It involves spatial discretization and the establishment of differentiable hydrological models for discrete sub‐basins, coupled with a differentiable Muskingum method for channel routing. By introducing upstream‐downstream relationships, model errors in sub‐basins propagate through the river network to the watershed outlet, enabling the optimization using limited downstream streamflow data, thereby achieving spatial simulation of ungauged internal sub‐basins. The model, when trained solely on the downstream‐most station, outperforms the distributed hydrological model in streamflow simulation at both the training station and upstream held‐out stations. Additionally, in comparison to transfer learning models, our model requires fewer gauge stations for training, but achieves higher precision in simulating streamflow on spatially held‐out stations, indicating better spatial generalization ability. Consequently, this model offers a novel approach to hydrological simulation in data‐scarce regions, especially those with poor hydrological representativeness.


Introduction
Streamflow prediction is an important field of hydrology, bearing profound implications for water resource management and flood hazard forecasting.Especially within the context of ever-shifting climatic patterns, the increasing water stress and water-related disasters (Hirabayashi et al., 2013;Pokhrel et al., 2021) have amplified the imperative for heightened precision in the realm of streamflow projection.Notably, recent years have witnessed substantial strides in computational capabilities and artificial intelligence (AI), significantly augmenting the accuracy of streamflow prediction models (Shen et al., 2018).Nonetheless, the endeavor to forecast streamflow in data-scarce regions still encounters substantial impediments (Shen & Lawson, 2021).Considering that, apart from a handful of regions such as Western Europe and the United States, numerous areas around the world have a limited number of years of streamflow data available due to sparse observations or policy restrictions (Do et al., 2017;Ma et al., 2021), coupled with frequent water security challenges in these regions (Xu et al., 2018), the streamflow prediction in data-scarce regions under climate change surfaces as a concern of global significance.
For streamflow prediction in data-scarce regions or ungauged basins (PUB), traditional methods predominantly rely upon hydrological models.One approach involves the use of distributed hydrological models, which discretize the region into small sub-basins or hydrological response units and utilize physical equations, such as energy and mass balances, to describe processes such as rainfall infiltration, runoff generation, evapotranspiration, and river routing.Distributed hydrological models encompass thousands of physical parameters that are related to soil properties, vegetation cover, and topographic features.However, due to a lack of observed data, these parameters are often solely calibrated using streamflow data at the watershed outlet, which leads to difficulties in parameterization and issues of equifinality (Beven, 2006).Furthermore, uncertainties stemming from model structure also pose significant challenges for model development (Moges et al., 2021).An alternate perspective involves regionalized hydrological models.These models are established and calibrated within gauged basins (donor basins), and then, leveraging similarity-based or regression-based methods, the model parameters are mapped to ungauged basins (target basins) (He et al., 2011).Nevertheless, this approach is also influenced by uncertainties stemming from the hydrological models built in the donor catchments.In addition, some uncertainties have arisen concerning regionalization methods and procedures, including but not limited to the different definitions of catchment similarities, the selection of catchment attributes, the number of donor catchments, as well as the temporal stability and transferability of hydrologic model parameters (Y.Guo et al., 2021).
In recent years, as the application of AI in the field of hydrology has deepened (Shen & Lawson, 2021;Shen et al., 2018), deep learning (DL) models have exhibited great potential in predicting streamflow within datascarce regions.Benefiting from the synergistic effect of copious data (Fang et al., 2022), regionalized DLrelated models including purely DL models (like Long and Short-Term Memory, or LSTM) (Feng et al., 2021;Kratzert et al., 2019) and physics-informed DL models (such as the differentiable learnable hydrological model) (Feng et al., 2023) have demonstrated excellence in the task of PUB.Similar to conventional regionalization methods, the regionalized DL-related model operates by mapping catchment attributes to model parameters, thereby transferring the mapping information from gauged basins to ungauged basins (Alvarez-Garreton et al., 2018;Kratzert et al., 2019).An outstanding advantage of DL lies in its ability, as compared to traditional hydrological models, to adeptly acquire insights into basin hydrological principles through numerous sites.Moreover, it surpasses the efficiency of traditional regionalization methods in exploiting the inherent correlations within basin attributes, and robustly extending knowledge to previously unseen basins, thereby manifesting heightened performance (Feng et al., 2022).Furthermore, delving deeper, among the two types of DL-related models, the regionalized physics-informed DL models can achieve similar accuracy and better generalization with less data than purely DL models (Shen et al., 2023), and are therefore more suitable for ungauged basins (Feng et al., 2023).This holds particularly true when accounting for non-stationary processes such as climate change, wherein purely data-driven models' accuracy substantially declines.Conversely, physicsinformed DL models exhibit commendable performance for extrapolating spatiotemporal scenarios, as highlighted in recent studies (Wi & Steinschneider, 2022;Zhong et al., 2023).
Despite the exceptional spatiotemporal extrapolation capabilities (i.e., the ability to extrapolate to unseen basins and climate) exhibited by the large-sample-trained regionalized DL models, some ungauged regions do not directly benefit.A major challenge is that if the target catchment is poorly represented in the training data set, such as when the hydrological response is different from that of the source catchment, the performance of the regionalized DL model is likely to be suboptimal in these regions.This was reported by Feng et al. (2021Feng et al. ( , 2023)).They divided the conterminous United States (CONUS) into seven regions based on spatial distribution, trained regionalized LSTM and physics-informed DL models on basins from six of seven regions, and subsequently tested their performance in the held-out region, revealing that the models yielded a relatively low Nash-Sutcliffe Efficiency (NSE) (Nash & Sutcliffe, 1970) in the southern Rocky Mountain (Region 6).This is attributed to the fact that this region is an arid environment dominated by groundwater with a distinct runoff response process compared to the other six regions.Consequently, the model fails to capture the correct hydrological principles for this region.Hence, when training regionalized DL models, it is imperative to ensure that the hydrological response characteristics of the target basins are adequately represented in the training data set.However, this may be a difficult task for some data-scarce regions.For instance, alpine cryospheric regions, characterized by intricate and unique frozen soil, glacier, and snow processes, exhibit limited data availability and are sparsely distributed globally, primarily found in the Asian Tibetan Plateau, European Alps, and certain regions within North American Rocky Mountains and South American Andes.The insufficient global representation of alpine catchments makes the direct application of regionalized DL models challenging.
An approach that indirectly leverages regionalized DL models to data-scarce regions is transfer learning (Pan & Yang, 2010;Sebastian & Lorien, 1998).Transfer learning is a method for transferring knowledge acquired from one task (source domain) to another similar task (target domain).In the context of hydrology, Ma et al. (2021) and Khoshkalam et al. (2023) pre-trained an LSTM model on large data sets like the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) data set (Addor et al., 2017;Newman et al., 2014), and subsequently kept partial model parameters (i.e., weights and biases) freezing (or not freezing at all) while retraining the model using local data.During this process, freezing certain parameters serves to preserve hydrological knowledge acquired from data-rich regions (even when unfrozen, it provides a meaningful initialization with rich hydrological knowledge for model parameters), thus facilitating the training of the model for the data-scarce regions and mitigating the limitations of local observational data (Ma et al., 2021).However, it is worth noting that existing studies about transfer learning primarily focus on prediction in basins that are characterized by having a limited number of years of data (poorly gauged basins) rather than the basins with truly no available data (ungauged basins).This is because transfer learning still requires a small amount of streamflow data from the target catchment for retraining.Although it is possible that transfer learning, after retraining on a certain number of stations, could extrapolate to other unseen stations (i.e., ungauged basins with similar hydroclimatic characteristics, which are not included in the retraining data).To our knowledge, this has not been tested.Furthermore, we suspect that transfer learning requires a high number of retraining stations to successfully extrapolate into ungauged basins with similar hydroclimatic characteristics, which poses a challenge for datascarce regions because there are usually only a limited number of streamflow gauge stations in a given region.Another critical concern is that, like the direct application of regional DL models, transfer learning may also encounter the problem of inadequate representation of the target catchment within the source catchments.This was observed by Yao et al. (2023) when attempting to transfer the CAMELS-trained LSTM model to the datascarce alpine basins on the Qinghai-Tibetan Plateau, which are represented in low proportions in the CAMELS basins.They reported that the effectiveness of the transfer learning model is limited in permafrost-dominated basins.
Beyond transfer learning, we propose that distributed modeling schemes may have the potential to leverage DL to address the streamflow prediction task in data-scarce regions.Here, we adopted a physics-informed DL model due to its superior spatiotemporal generalization capabilities, as previously discussed, in comparison to purely DL models.The distributed modeling scheme involves the spatial discretization of a given data-scarce region and building individual physics-informed DL models for all discrete sub-basins.The prediction errors of each subbasin are propagated downstream via explicit upstream-downstream relationships (i.e., river routing), allowing to constrain and optimize all upstream sub-basin models with only a small amount of data from downstream streamflow.Therefore, in comparison to the transfer learning model, this approach may harbor the potential to attain superior prediction for the ungauged sub-basins and circumvents the problem that transfer learning models may perform poorly when applied to under-represented target basins.Moreover, compared with the traditional distributed hydrological model, this scheme harnesses DL to enhance modeling and diminish structure uncertainties of hydrological models (Feng et al., 2022).Consequently, we expect it to offer a more adaptable deployment solution for data-scarce regions.
Therefore, the objective of this study is to develop a distributed modeling scheme based on physics-informed DL to address the challenge of streamflow prediction in data-scarce regions.Across the globe, watersheds contend with varying degrees of data scarcity.Some watersheds may possess only one discharge gauge station at the outlet, while others may sporadically feature two or three additional stations dispersed upstream.With this in mind, we first assessed the influence of varying numbers of training sites on the distributed physics-informed DL model.Subsequently, both a distributed hydrological model and a CAMELS-based transfer learning model served as benchmark models to validate the model performance.

Study Area and Data
In this study, we chose a typical alpine catchment, the Source Region of the Yellow River (SRYR), as a case to develop a distributed physics-informed DL model.The alpine regions represented by SRYR are globally significant water sources (Immerzeel et al., 2020).However, due to their remote geographical location and harsh climate, they also stand as some of the most challenging areas worldwide in terms of data acquisition (Weber et al., 2021).Furthermore, the rugged terrain of these regions engenders substantial hydrological spatial heterogeneity (Riseborough et al., 2008;Wu et al., 2010), necessitating refined spatial simulations of internal hydrological variables.Considering that alpine regions are dominated by cryospheric processes such as permafrost, glaciers, and snow, their hydrological dynamics are both unique and intricate (Beniston et al., 2018;Walvoord & Kurylyk, 2016).The presence of knowledge gaps often results in considerable uncertainty within traditional hydrological models (Beniston et al., 2018;H. Gao et al., 2021), prompting the need for the aid of DL techniques.However, the proportion of alpine watersheds within global gauged watersheds is typically quite low (Zhong et al., 2023), rendering it a challenge to regionalize large-sample-based DL models.Thus, the development of a distributed physics-informed DL model for alpine watersheds holds profound practical significance, given the confluence of these unique challenges.Situated in the northeastern part of the Qinghai-Tibetan Plateau (QTP) (Figure 1), the SRYR rests at elevations ranging from 2,656 to 6,252 m, featuring a continental alpine climate.The region exhibits an annual mean temperature of 2°C and an annual mean precipitation of 507 mm.Over the period from 1960 to 2019, climatic shifts have been distinctly observed, with a warming rate of 0.33°C per decade (Zhong et al., 2023).While the total drainage area of the SRYR spans 124,000 km 2 , merely five hydrological stations are dispersed within the basin, highlighting the issue of data scarcity.Hydrological processes within the watershed are significantly influenced by cryosphere dynamics.The upper portion above the Jimai (JIM) station is dominated by permafrost, transitioning to seasonally frozen soil in downstream regions (Cao et al., 2023).Additionally, the northern part of the catchment harbors the Ányêmaqên Glacier (W.Guo et al., 2014).There are two large lakes, Zhaling Lake and Erling Lake, in the northwest of the catchment.
The shuttle radar topography mission (SRTM) digital elevation model (DEM) with a resolution of 90 m (Jarvis, 2008) was used to delineate the watershed and subsequently subdivide it into discrete sub-basins.Daily forcing data from 1960 to 2019 used in this study includes air temperature (tas) and precipitation (pr) collected from 27 meteorological stations, and the potential evapotranspiration (pet) was calculated using the FAO Penman-Monteith equation (Allen et al., 1998).All the meteorological data were downloaded from the National Meteorological Information Center of China Meteorological Administration (http://data.cma.cn) and then spatially interpolated with an angular distance weighting method with elevation corrections (D.Yang et al., 2004) to calculate basin-averaged values for each sub-basin.
Static catchment attributes include topography, soil, climate, and vegetation properties (Table S1 in Supporting Information S1).In cases where data availability allowed, efforts were made to closely match the static catchment attributes with those in the CAMELS data set, with the exception that the data sources employed in this study differed from those utilized in CAMELS.Among these, soil data was obtained from the Land-Atmosphere Interaction Research Group at Sun Yat-sen University (Dai, Wei, et al., 2019;Dai, Xin, et al., 2019;Shangguan et al., 2013Shangguan et al., , 2017)).Vegetation attributes originated from sources including ESA WorldCover 10 m 2021 v200 (Zanaga et al., 2022) and AVHRR GIMMS LAI3g (Zhu et al., 2013).All attributes were calculated averages within the sub-basin.Additionally, to account for watershed heterogeneity, for topography attributes (excluding area) and soil attributes, the minimum, maximum, and standard deviation of each variable within the catchments were also calculated.
Daily streamflow data from Jimai (JIM), Metang (MET), Maqu (MAQ), Jungong (JUG), and Tangnaihai (TNH) were used to train and validate the model, which was obtained from the Information Center of the Chinese Ministry of Water Resources.Among these stations, streamflow data for JIM, MAQ, and TNH spans the period from 1960 to 2019, with data for the years 1983 and 1990 missing at JIM.Data from MET and JUG stations covers the years from 2000 to 2019, but the MET station solely records observational data for the months from May to October.Beyond streamflow, we also utilize GLDAS NOAH v2.1 evapotranspiration (ET) data (with a monthly temporal resolution and a spatial resolution of 0.25°) (Beaudoing & Rodell, 2020) to supplement the evaluation of model performance across all sub-basins.As per prior research, this product has demonstrated the best overall performance within the SRYR (Li et al., 2019).Although the GLDAS NOAH v2.1 ET has a slightly higher accuracy in the study area than other products, it is important to note that it still does not represent ground truth and inherently carries significant uncertainties.Specifically, it tends to overestimate ET compared to water balance-based estimates, with a root-mean-square-error (RMSE) of approximately 0.6 mm/day (Li et al., 2019).
The CAMELS data set (Addor et al., 2017;Newman et al., 2014) was used to pre-train a regionalized physicsinformed DL model and subsequently transferred it to the SRYR.The CAMELS data set contains the forcing, static catchment attributes, and streamflow for 671 watersheds across the contiguous United States (CONUS), among which the forcing data is the Daily Surface Weather Data on a 1-km grid for North America (Daymet) (Thornton et al., 2020).

Development of a Distributed Physics-Informed DL Model
In this study, we adopted the differentiable parameter learning (dPL) framework proposed by Tsai et al. (2021) and Feng et al. (2022) to develop a distributed physics-informed DL model.dPL refers to a methodology that involves the symbiotic modeling of physics and neural networks (NN), employing differentiable programming to track gradients, in pursuit of optimal parameter sets for process-based geoscientific models (Feng et al., 2022).dPL embodies a form of DL model constrained by scientific understanding, endowing DL with enhanced interpretability, generalizability, and extrapolative capabilities with less data.Consequently, it is suitable for datascarce regions (Shen et al., 2023).
To develop a distributed dPL model, we utilized the SRTM DEM to extract the river network, by which the catchment was divided into 69 sub-basins (Figure 2b), ranging from 181 to 5,114 km 2 with a mean area of 1,792 km 2 .For each sub-basin, we apply the dPL framework and the exponential bucket hydrological model (EXP-HYDRO) (Patil & Stieglitz, 2014) to calculate the runoff generation process (Figure 2a).The EXP-HYDRO model consists of two state buckets (snow bucket and soil water bucket) and six physical parameters that chiefly account for processes including precipitation, snow accumulation and melt, soil water, subsurface flow, and surface flow.The small glaciers and lakes were not considered here since they have little effect on the streamflow in the SRYR (J.Yang et al., 2003).To enhance the model's applicability in alpine catchments, several refinements were implemented in the EXP-HYDRO (refer to Table S2 in Supporting Information S1): 1. Given the significance of soil freeze-thaw in alpine domains, we introduced a parameter (α) within the soil water bucket to conceptualize the ice content S s s ) in the soil water.When soil is freezing, only the unfrozen portion of soil water S l s ) contributes to streamflow and ET calculations.Compared with the unfrozen state, we assume that the water-holding capacity of frozen soil diminishes, with the maximum storage decreases from The sub-basins were coded in the routing order.The lowest sub-basin of the mainstream was coded as 1.Along the upper mainstream, the number of sub-basins coded was added by 1 (e.g., 2, 3, 4…).The lowest subbasin of the tributary was coded x.1, and up the tributary, the code was added 0.1 (e.g., x.2, x.3…).For example, the upper reach of the primary sub-basin 1 is the primary sub-basin 2, which can be subdivided into two secondary sub-basins 2.1 and 2.2 along the tributary.When calculating the flow routing, the secondary subbasins were considered first and then the primary sub-basins.For example, the runoff from sub-basin 2.2 first flows along the tributary to the outlet of sub-basin 2.1, then superimposes the runoff generated in sub-basin 2.1 and the flow routed from sub-basin 3.1, and finally flows along the mainstream to the outlet of sub-basin 1.
S max to S max S s s .We caution that EXP-HYDRPO is a conceptual model, so the frozen water content here does not correspond to actual frozen soil depth.2. Inspired by the design of Feng et al. (2022), an additional parameter (β) was added for the ET calculation to account for vegetation effects, as the original formulation solely considered soil moisture.3.As in Mizukami et al. (2017), we integrated a hillslope routing module into the surface flow (Q f ) component, utilizing convolution calculations with a gamma distribution-based unit hydrograph to derive the final surface runoff (Q * f ) at the outlet of each sub-basin.The module encompasses a shape parameter (a) and a timescale parameter (b).
In the aforementioned parameters, as the state of soil freeze-thaw undergoes dynamic fluctuations, we designated the parameter α as time dynamic.Considering that soil freezing will reduce soil hydraulic conductivity (Agnihotri et al., 2023) and therefore influence the velocity of soil water outflow, the parameter F (which controls the rate of decline in flow from soil bucket) was also considered as dynamic.Furthermore, mirroring the approach of Feng et al. (2022), for the ET correction parameter β, we set it to be dynamic to mitigate the influence of vegetation phenology on ET.As for the remaining parameters, they were all considered static.All these parameters were determined by the dPL framework, wherein the inputs encompass forcing and catchment static attributes.Within this framework, the dynamic parameter learning network is constituted by LSTM, yielding different parameter values at each time step.In the realm of static parameter learning, a onedimensional convolutional (Conv1D) (Fukushima & Miyake, 1982) encoder is initially deployed to extract features from meteorological data, which are subsequently concatenated with catchment attributes.This encoder serves to complement and alleviate uncertainties inherent in the selection of basin attributes (Botterill & McMillan, 2023).The concatenated features then proceed as input to a fully connected neural network (FCNN), thereby ascertaining static parameter values.
After calculating the runoff generation for sub-basins, the runoff from each sub-basin will converge sequentially from upstream to downstream along the river network to the outlet of the watershed.In this study, the computation of river routing was performed utilizing the dPL + Muskingum routing method, as shown in Figure 2b.The Muskingum routing method is a variation of the kinematic wave model (Te Chow et al., 1988), widely employed in the hydrological models for the calculation of river routing, such as SWAT (Neitsch et al., 2011) and HEC-HMS (Scharffenberg & Fleming, 2006).According to the Muskingum method, for a given river segment, the outflow at the end of the timestep (q out,2 ) is determined by the inflow at the end of the timestep (q in,2 ) , as well as the outflow and inflow at the start of the timestep (q in,1 and q out,1 ).The calculation of q out,2 involves three coefficients (C 1 , C 2 , and C 3 ), which could be ascertained based on the parameters of storage time constant (K), weighting factor (X), and time step (Δt).The specific derivation process was elaborated in Text S1 in Supporting Information S1.
In conventional approaches (such as that in the SWAT model), X is defined by the user, and K is usually determined by the Cunge equation (Cunge, 1969), involving Manning's coefficient, slope, length of the river segment, hydraulic radius, and flow velocity.Consequently, the parameterization entails considerable uncertainty and high data requirements.Considering the paucity of data in data-scarce regions, we employed the dPL framework to automatically determine K and X for each river segment.The dPL framework is constructed by an FCNN layer, wherein the input encompasses the silt, clay, and sand fractions of the channel, in addition to the channel length, channel slope, and river width.Among these attributes, the initial five could be computed using soil data and DEM.As for the river width, given the absence of measurements, it was approximated herein using the upstream catchment area, as prior studies reported a substantial correlation between river width and the area of the upstream catchment (Lu et al., 1999).

Evaluation of the Developed Distributed dPL Models When Trained With Different Number of Gauge Stations
This experiment was designed to explore the performance of the newly developed distributed dPL (abbreviated as dis-dPL) models trained with different numbers of gauge stations.Three different models were built here.The first model was solely trained using data from the lowest downstream station TNH.In addition to TNH, the second model considered the midstream station MAQ for training.Further, the third model was trained with the Water Resources Research 10.1029/2023WR036333 ZHONG ET AL. downstream station TNH, the midstream station MAQ, and the upstream station JIM simultaneously.Through a comparative analysis of these three models, we could discern the influence of incorporating upstream station data on the model's performance.

Compare the Newly Developed Distributed dPL Model With Two Widely Used Approaches in Data-Scarce Regions
Physics-based distributed hydrological models and transfer learning models are two commonly used methods for hydrological modeling in data-scarce regions.They represent two different approaches to address the PUB task: one involves distributed spatial discretization and modeling to obtain runoff simulations within internal ungauged sub-basins, while the other relies on regionalization of the lumped model, establishing mappings from basin attributes to model parameters, to simulate runoff in ungauged sub-basins.Therefore, they were selected as benchmarks for comparative analysis with the newly developed distributed physics-informed DL model, which will aid in further validating the feasibility of our approach.
The first benchmark model we chose here was the distributed geomorphology-based eco-hydrological model (GBEHM) (B.Gao et al., 2016).The GBEHM is a widely applied physically based distributed model for alpine regions and has been extensively validated in the SRYR (Qin et al., 2017;J. Yang et al., 2023), Yangtze River source region (Shi et al., 2020), and upper reaches of the Heihe River (B.Gao et al., 2016).The GBEHM model was built with a spatial resolution of 3 km 2 and a temporal resolution of 1 hr.Daily meteorological data spanning from 1960 to 2019 were used to force the model.For more model settings, please refer to J. Yang et al. (2023).
Another benchmark model is the CAMELS-based transfer learning dPL model.The transfer learning model maintains an identical model structure to that of the distributed model, namely the dPL + EXP-HYDRO, ensuring a fair comparison.The only distinction lies in its lack of employment of a routing module, maintaining a lumped structure akin to existing studies (Khoshkalam et al., 2023;Ma et al., 2021).We aimed to assess the spatial generalization capabilities (i.e., the ability to predict in ungauged stations that are not included in the retraining stations) of the lumped transfer learning (abbreviated as lum-TL) model within a data-scarce catchment.To achieve this, we utilized data from three discharge gauge stations (TNH, MAQ, and JIM) for model retraining and subsequently evaluated their performance in simulating streamflow at two spatial held-out stations (MET and JUG), as well as the ET across all sub-basins.We caution that the utilization of only three stations for retraining the transfer learning model is intended to accurately reflect the data constraints encountered in data-scarce watersheds, aiming to assess the performance of the transfer learning model under conditions of data scarcity.

Assess Whether Distributed Transfer Learning Based on Large Samples Can Improve Streamflow Modeling in Data-Scarce Regions With Poor Hydrological Representativeness
In this experiment, we further integrate the distributed framework with transfer learning to establish the models.The distributed transfer learning (abbreviated as dis-TL) model's structure is identical to the distributed dPL model described in Section 2.3.1.The difference is that we used the CAMELS pre-trained dPL + EXP-HYDRO model to provide the initial parameters for the runoff generation module in each sub-basin.Then, the whole model (including runoff generation and river routing modules) was retrained using the data from the SRYR.
We similarly trained three separate models with different numbers of hydrological stations and compared them with the three dis-dPL models from Section 2.3.1.The purpose was to determine whether hydrological insights gained from large data sets via pre-training could enhance streamflow simulation in data-scarce alpine catchments that have distinct processes and poor representativeness.Besides, the dis-TL model was further compared with the lum-TL model built in Section 2.3.2, to highlight the effectiveness of incorporating river channel routing by the distributed framework.
The aforementioned three experiments involve three distributed dPL models, one lumped transfer learning dPL model, one GBEHM model, and three distributed transfer learning dPL models.For brevity, we abbreviated the seven DL-related models as A-B-C (see Table 1), where A is either "dis" (distributed) or "lum" (lumped), B is dPL or TL, and C is the number 1, 2, or 3, indicating the number of hydrological stations used for model training.Specifically, 1 denotes training the model solely with streamflow data from the downstream station TNH, 2 indicates the utilization of data from both downstream station TNH and midstream station MAQ, while 3 involves data from downstream station TNH, midstream station MAQ, and upstream station JIM.

Water Resources Research
10.1029/2023WR036333 ZHONG ET AL.

Model Training Details and Evaluation Metrics
The three dis-dPL models were trained on a daily scale with the data from 1 January 1960 to 31 December 1989.The period from 1 January 1990 to 31 December 1999 was used as the validation period to guide hyperparameter tuning, and the testing period was 1 January 2000 to 31 December 2019.The lum-TL-3 model and runoff generation module of three dis-TL models were initially pre-trained on the CAMELS data set from 1980 to 1995.Subsequently, the model weights and bias were retrained using the daily streamflow data from the SRYR.The training, validation, and testing periods were kept the same as the dis-dPL models.Within this study, we conducted separate experiments (refer to Text S2 in Supporting Information S1) to assess the performance of the lumped and distributed transfer learning models retrained with model parameters partially freezing or not freezing at all, as in Ma et al. (2021).The results showed that the unfrozen configuration yielded the optimal performance.Therefore, in the subsequent sections, the transfer learning models specifically denote this variant.
The model hyperparameters include the learning rate, the number of Conv1D layers, the number and size of convolution kernel per layer, and the hidden layer size of the LSTM layer and FCNN layer.We tried different parameter combinations to ensure that each model reached the optimum, and the final selected hyperparameters are shown in Table S3 in Supporting Information S1.Preliminary experiments showed a sequence length of 4 years and an additional warm-up period of 1 year yielded the optimum model (refer to Figure S2 in Supporting Information S1).After experiments (Figure S3 in Supporting Information S1), we set the number of multicomponents as 16 to consider the heterogeneity within sub-basins, as in Feng et al. (2022).During the hyperparameter tuning, gradient clipping and early stopping methods were used to prevent gradient explosion and reduce overfitting, respectively.To obtain more robust results, each model was trained with six different random seeds, and the ensemble mean was calculated for evaluation.As for the loss function, referring to previous studies (Kratzert et al., 2019), we chose the NSE loss weighting at different stations.For GBEHM, the model calibration was performed using the daily discharge data from TNH during , which is consistent with the dis-dPL-1 model, thus allowing a fair comparison.
The model evaluation focused on the precision of streamflow simulation at the five stations during the test period.For each model, an assessment was conducted concerning its performance on both the temporal test (stations that were used for training) and spatial test (spatial held-out stations that were not used for training), respectively.For example, for the dis-dPL-3 model, the temporal test stations encompassed TNH, MAQ, and JIM, whereas the spatial testing stations consisted of MET and JUG.Compared with the temporal test stations, the performance on spatial test stations more effectively reflects the model's ability to capture the spatial pattern of streamflow.The streamflow evaluation involved two metrics: NSE and RMSE, calculated as follows: where n is the sequence length, y o is the observed streamflow, and y s is the simulated streamflow.The overline denotes the mean value.The NSE ranges from ∞ to 1, with 1 indicating the perfect performance.Traditionally, an NSE greater than 0.65 is considered a good performance (Moriasi et al., 2007).In addition to streamflow, the time series of simulated ET in 69 sub-basins were also evaluated against the GLDAS NOAH ET, using RMSE and Pearson correlation coefficient (R).

Performance of the Distributed dPL Models Trained With Different Numbers of Hydrological Stations
The streamflow simulation results during the test period under varying amounts of data for model training are illustrated in Figures 3 and 4. Notably, the dis-dPL-1 model exhibits commendable performance even when trained using only data from a single station (TNH): the NSE exceeds 0.9 at TNH (Figure 3a), and satisfactory results are achieved at the four previously unseen upstream stations (i.e., JUG, MAQ, MET, and JIM) .This underscores the efficacy of the distributed framework: training the model solely with data from the downstream-most stations can effectively constrain the model's runoff generation and river channel routing in various sub-basins, resulting in reliable spatial streamflow simulations.This is exemplified by the model's outstanding performance at four spatially held-out stations.
Another discovery is that with the inclusion of the midstream station MAQ, the performance of the dis-dPL-2 model at the most downstream station TNH only slightly deteriorates (Figures 3a and 4a), while predictions for upstream stations (except JIM) are improved (Figures 3b-3d and 4b-4d).When the uppermost station JIM is further included, the dis-dPL-3 model shows continuous improvement in predictions for upstream stations while maintaining good performance at downstream stations: the NSE for MET and JIM relative to the dis-dPL-2 model increased by 0.07 and 0.128 (Figures 3d and 3e and 4d and 4e), respectively.
The hydrograph results also show that the dis-dPL-1 model performs well in simulating peak flow and peak timing at downstream stations TNH and JUG, with the annual mean hydrograph curves closely matching observed data (Figures 5a and 5b).However, at upstream stations such as MET and JIM (Figures 5d and 5e), the peak timing appears to be earlier than the observation.Nevertheless, with the inclusion of upstream stations for training, this erroneous advancement is effectively corrected by the dis-dPL-2 and dis-dPL-3 models, and the simulated hydrographs better resemble the observed ones.
This indicates that adding stations can effectively enhance the predictive capability of the distributed dPL model, especially the inclusion of upstream stations can effectively improve simulations of upstream basins without compromising much downstream performance.Additionally, a meaningful point is that the distributed dPL model demonstrates relatively low data requirements, as in this study, even training on just one station at the most downstream location yields decent performance.

Comparison of Distributed dPL Models With the Two Widely Used Approaches in Data-Scarce Regions
The performance of the lum-TL-3 and GBEHM models in simulating streamflow during the test period is shown in Figures 3 and 4. The dis-dPL-1 and GBEHM models were employed for comparative analysis to explore the merits and demerits of our newly developed model in comparison to traditional distributed hydrological models.The two models' performances differ significantly: not only does the dis-dPL-1 model exhibit significantly higher NSE and significantly lower RMSE at the temporal test station (TNH) compared to GBEHM (Figures 3a and 4a), but it also demonstrates substantial advantages at upstream spatial test stations (JUG, MAQ, MET, and JIM) .This underscores the considerable advantages of the dis-dPL model in spatiotemporal generalization when compared to traditional models.
Besides, the comparison between the dis-dPL-3 and lum-TL-3 models provides insights into the applicability of our model over the lumped transfer learning dPL model in ungauged catchments.Under the same data training conditions, the dis-dPL-3 model exhibits significantly better performance than the lum-TL-3 at both the temporal test stations (TNH, MAQ, and JIM) (Figures 3a,3c,3e and 4a,4c,4e) and spatial test stations (MET and JUG) (Figures 3b,3d and 4b,4d).Especially in the midstream and upstream stations where the simulation is more difficult, this performance advantage of the dis-dPL-3 model is more pronounced, with an NSE enhancement of 0.057 at MAQ, 0.105 at MET, and 0.087 at JIM.Even the dis-dPL-1 model, trained with data from only the downstream-most station, performs slightly better than lum-TL-3 at the two spatial test stations.This highlights the dis-dPL model's ability to attain comparable temporal extrapolation and better spatial generalization in streamflow simulations while demanding less data for training.

Comparison of the Distributed TL Models With the Lumped TL Model and Distributed dPL Models
In contrast to the lum-TL-3 model, the dis-TL-3 model was retrained with the same data but achieved significant improvements in NSE at upstream stations while maintaining comparable performance at downstream stations (refer to Figures 3 and 4).This reveals the potential of combining transfer learning and the routing module of distributed frameworks to enhance simulations.Besides, the dis-TL models' performance in simulating streamflow exhibits a similar pattern with dis-dPL models: with the addition of upstream hydrological stations for retraining, there is a slight deterioration in NSE at the downstream station TNH, while notable improvements are observed for upstream stations, particularly at MET and JIM stations.Similarly, the overall performance of the dis-TL-1 model, retrained with only one local downstream station, is acceptable, reaffirming the findings in Section 3.1 and the great value of the incorporation of the distributed framework and dPL.
Nonetheless, comparing the three dis-TL models with the dis-dPL models trained using the same local hydrological stations, certain performance disadvantages of the dis-TL models could be observed.As described in Section 2.4, the parameters of the dis-TL models adopted in this study were entirely unfrozen during the retraining process.Based on this, one intuitive understanding may be that even if the hydrological knowledge obtained from pre-training does not apply well to the local region, the model should have sufficient flexibility to adjust and relearn during retraining.However, our findings are "counterintuitive" in a way.The results described in Figure S2 in Supporting Information S1 already shed some light on the problem.It is evident that when we fully or partially freeze the parameters of LSTM and Conv1D layers (i.e., corresponding to the dis-TL-3a, dis-TL-3b, dis-TL-3c models), their performance is essentially equivalent to that of the dis-TL-3d model (the variant we ultimately chose).This suggests that even though transfer learning models are given ample learning space to adapt to local hydrological processes if the knowledge obtained from the pre-training is quite different from that of the local hydrological process, the model may be easily misled by wrong experience and fall into the cognitive rigidity, resulting in the local optimal trap for all four models.
In Figure 6, we further analyzed the spatial distribution of runoff (the sum of surface and subsurface flow that has not entered river channel routing) simulated by the four variants of the dis-TL-3 models and found that regardless of parameter freezing during retraining, the spatial distribution of runoff simulated by the four variants remained consistent, differing significantly from dis-dPL-3.This further validates our previous conjecture that distributed transfer learning dPL models might be trapped in local optima, failing to adapt to local hydrological processes, thereby resulting in slightly inferior performance compared to local models.

Further Validation of the dis-dPL Models by ET Comparison and State Variables Interpretation
The temporal evaluation of the eight models' simulation for monthly ET across all sub-basins during the test period is shown in Figure 7. Overall, the three dis-dPL models exhibit relatively comparable performance in simulating ET.At the monthly scale, the median R between simulated ET and GLDAS NOAH ET is approximately 0. with an RMSE of approximately 0.6 mm/d, we consider the ET simulations produced by the dis-dPL models generally fall within an acceptable range.
Compared to other DL-related models, the dis-dPL models remain competitive in capturing the temporal variation of ET.While median R and median RMSE of ET simulated by the dis-TL model are comparable to those of the dis-dPL models, the longer whiskers of the box plots indicate slightly greater standard deviation and inferior robustness.On the other hand, the lum-TL-3 model, in terms of both median metrics and standard deviation, notably lags behind the dis-dPL models.Text S3 in Supporting Information S1 further compares the performance of all models in simulating the spatial distribution of ET and snow.Overall, the dis-dPL-3 model demonstrates a relatively good match with the spatial patterns of remotely sensed ET and snow and is competitive compared to the other DL-related models.The temporal variations of ET simulated by the GBEHM model exhibit the closest  resemblance to GLDAS NOAH ET, with a median R of 0.954 and a median RMSE of 0.524 mm/d.However, it performs the worst in simulating the spatial patterns of ET and snow depth.
We further visualized the spatiotemporal variations of soil freeze-thaw (Figures S7 and S8 in Supporting Information S1).It is worth mentioning that, despite limited expert guidance on the freezing-thawing process, the dis-dPL-3 model successfully captures its fundamental spatiotemporal variations.Figure S7 in Supporting Information S1 illustrates that the spatial distribution of frozen and unfrozen water content simulated by the model closely approximates the distribution of frozen soil types within the SRYR (Figure 1).The phenomenon displayed in Figure S8 in Supporting Information S1, where the unfrozen water content in the two typical basins remains unchanged and then rises sharply during the thawing season, is also consistent with the "low runoff in early thawing (LRET)" phenomenon revealed by H. Gao et al. (2022).Detailed analysis can be found in Text S4 in Supporting Information S1.

The Potential of Distributed Physics-Informed DL Models for the PUB Task
Our work demonstrates the distributed dPL models' superior spatial generalization for the PUB task.The model, trained only with data from most downstream stations, achieves satisfactory performance even at upstream spatial test stations.Furthermore, as additional upstream stations are included for training, the model enhances its streamflow simulation at upstream stations with a minor performance cost at downstream stations, which aligns with findings from Bindas et al. (2024).In contrast to Bindas et al. (2024), we further enable collaborative adaptation between the runoff generation module and the routing module during training, which is more suitable for data-scarce regions.This is because data scarcity makes it challenging to obtain precise and seamless runoff predictions for sub-basins in advance.Besides, by combining the runoff generation module and river routing module, the model errors in individual sub-basins can propagate downstream through the river network, allowing optimization of the models for all sub-basins with training data from only a few downstream gauge stations.
We note that, despite the commendable performance at the three temporal test stations, the lumped transfer learning model's simulation of the streamflow at the spatial test station MET and ET in ungauged sub-basins is not optimal.This indicates that the spatial generalization capacity of transfer learning is indeed limited when retrained with three discharge gauge stations.Given that the lumped transfer learning model extrapolates to unseen basins based on the mapping from model inputs to parameters learned from large samples (Kratzert et al., 2019), this limitation may be partly attributed to the fact that the mapping from the forcing and static attributes to model parameters established during pre-training may not apply well in the target catchment due to significant differences in basin characteristics and hydrological processes.This mapping needs to be rebuilt during the retraining process.However, the limited data availability in data-scarce regions for retraining may result in a less robust reconstruction of the mapping relationship, thereby affecting the model's spatial generalization ability.While we speculate that the inclusion of more stations in retraining could contribute to the enhancement of the spatial generalization capability, we caution that this may go beyond the available data reality for most data-scarce regions.
Further integration of the distributed framework based on transfer learning partially alleviates the aforementioned issues.With the integration of the routing module, the distributed transfer learning model displays notably better spatial generalization ability than the lumped ones, which manifests in the enhanced simulation of streamflow at spatial testing stations and improved evapotranspiration estimates in sub-basins.This enhancement is mainly because the distributed modeling framework explicitly embeds upstream and downstream relationships to characterize spatial dependencies between ungauged sub-basins, requiring less data than the lumped transfer learning model that implicitly learns basin dependencies from large amounts of data.Another intriguing finding was that the distributed transfer learning dPL models pre-trained on large data sets did not outperform, and even somewhat underperform, the local distributed dPL models.This could be attributed to the poor hydrological representativeness of alpine basins in global data sets, rendering the hydrological knowledge obtained from the pre-training potentially inapplicable.Similar findings were reported by Yao et al. (2023), who transferred LSTM pre-trained on the CAMELS to the SRYR, yielding a daily NSE of only 0.62 at the JIM station, much lower than the NSE of 0.73 obtained from a local DL model in the previous study (Zhong et al., 2023).This further reminds us that when transfer learning models are applied to ungauged basins, their performance is subject to the dual constraints of insufficient retraining site quantity and poor hydrological representativeness. Water On the other hand, the distributed physics-informed DL model is also superior to the traditional distributed hydrological model GBEHM in terms of the spatiotemporal generalization ability.This improvement is primarily due to the incorporation of DL, which compensates for the shortcomings of physically based models (Feng et al., 2022;Shen et al., 2023).These shortcomings include gaps in process understanding, such as the representation of frozen soils (H.Gao et al., 2021), leading to uncertainty in model structure, and issues with equifinality during parameterization (Beven, 2006).Specifically, in this study, we utilized the dPL framework to estimate the parameters governing the runoff generation and channel routing calculations, with the design of dynamic parameters α and β explicitly addressing the uncertainties related to freeze-thaw processes and ET calculations.The application of DL techniques not only enhances computational efficiency but also effectively improves the performance of the model in simulating streamflow at both temporal and spatial test stations, as well as in reproducing the spatial pattern of remote sensing ET, demonstrating remarkable potential in the context of the PUB task.

Limitations and Future Work
Despite the significant potential exhibited by the distributed physics-informed DL model for the PUB task, we would like to emphasize certain limitations inherent in the existing methods.Due to the limited availability of discharge gauge stations in data-scarce regions, this study only considered models trained with a maximum of three stations, thus not fully exploring the impact of the number of training stations on model performance.The constraints imposed by the limited number of stations may similarly affect the performance of transfer learning models, and it is speculated that, with the utilization of more stations for retraining, the relative advantage of our model over the transfer learning model may diminish.This awaits further exploration in future research endeavors.Our model is more suitable for regions with particularly scarce data, where the basins are poorly represented in global watersheds.
Additionally, the small glaciers and lakes were not considered in this study, contributing to some level of uncertainty in the model's performance in related sub-basins.As Kraft et al. (2022) pointed out, the absence or misunderstanding of processes may lead to distortions in certain latent variables as the model attempts to compensate for these processes.The original intention behind the design of the distributed physics-informed model is to achieve a more physically based quantification of hydrological processes while effectively leveraging DL to compensate for process understanding.However, the trade-off between embracing more complex structures and the risks of parameter equifinality should be treated with caution (Feng et al., 2022), especially in the absence of better physical understanding and additional data constraints (Kraft et al., 2022).In our work, the distribution of meteorological stations in the upper sub-basins above the JIM station is sparse, resulting in limited accuracy of interpolated meteorological data (especially precipitation) (Figure S9 in Supporting Information S1) and basin attributes (Figure S10 in Supporting Information S1).The uncertainty in input data may also lead to the offsetting between different processes, partially explaining the anomalies in certain subbasins.More details can be found in Text S5 in Supporting Information S1.

Conclusions
In this study, we addressed the challenge of hydrological simulation in ungauged catchments by developing a physics-informed DL model based on a distributed framework.This model utilizes the differentiable EXP-HYDRO framework to calculate runoff generation in discrete sub-basins and employs the differentiable Muskingum routing method to simulate channel routing.The main conclusions are as follows: 1.The distributed physics-informed DL model exhibits a low training data requirement, while strong spatiotemporal generalization ability.The model trained using the most downstream station can achieve satisfactory simulations of streamflow at upstream stations and ET across ungauged sub-basins.Besides, the addition of more upstream stations can effectively enhance the simulations at these locations with minor performance degradation at downstream stations.2. The CAMELS-pretrained transfer learning model exhibits suboptimal performance in simulating streamflow at spatially held-out stations when retrained using only three local stations.This reflects the limitation of transfer learning in areas with a limited number of stations and poor hydrological representativeness.
3. Compared to transfer learning and distributed hydrological models, the distributed physics-informed DL model demonstrates stronger spatiotemporal generalization capabilities in data-scarce basins with poor hydrological representativeness.This provides a new approach to addressing the PUB task.

Figure 1 .
Figure 1.Relative location of the study area and the distribution of meteorological stations, hydrological stations, frozen ground, lakes, as well as glaciers within the catchment.The frozen ground data was adapted from Cao et al. (2023).

Figure 2 .
Figure 2. Diagram of the distributed physics-informed DL hydrological model.(a) Runoff generation module, (b) river routing module.The sub-basins were coded in the routing order.The lowest sub-basin of the mainstream was coded as 1.Along the upper mainstream, the number of sub-basins coded was added by 1 (e.g., 2, 3, 4…).The lowest subbasin of the tributary was coded x.1, and up the tributary, the code was added 0.1 (e.g., x.2, x.3…).For example, the upper reach of the primary sub-basin 1 is the primary sub-basin 2, which can be subdivided into two secondary sub-basins 2.1 and 2.2 along the tributary.When calculating the flow routing, the secondary subbasins were considered first and then the primary sub-basins.For example, the runoff from sub-basin 2.2 first flows along the tributary to the outlet of sub-basin 2.1, then superimposes the runoff generated in sub-basin 2.1 and the flow routed from sub-basin 3.1, and finally flows along the mainstream to the outlet of sub-basin 1.

Figure 3 .
Figure 3. Models' NSE in simulating daily streamflow at the five discharge gauge stations during the test period."T" denotes the temporal test station for the model and "S" denotes the spatial test station.

Figure 4 .
Figure 4. Models' RMSE in simulating daily streamflow at the five discharge gauge stations during the test period."T" denotes the temporal test station for the model and "S" denotes the spatial test station.

Figure 5 .
Figure 5. Annual mean hydrograph simulated by three distributed dPL models during the test period.

Figure 6 .
Figure 6.The average spatial runoff (i.e., the sum of surface flow and subsurface flow before river routing, measured in mm/d) during the testing period, simulated by four dis-TL-3 variants, with partially frozen or entirely unfrozen parameters, and the dis-dPL-3 model.(a) dis-TL-3a, with the parameters in the LSTM and Conv1D layers frozen, (b) dis-TL-3b, with the parameters in the LSTM layers frozen, (c) dis-TL-3c, with the parameters in the Conv1D layers frozen, (d) dis-TL-3d, with no parameters frozen, and (e) dis-dPL-3.

Figure 7 .
Figure 7.Comparison of the eight models' performance in simulating monthly ET during the test period.

Table 1
The Models Established in the Three Experiments and Their Abbreviations ZHONG ET AL.
85, with a median RMSE of around 0.76 mm/d.Given that GLDAS NOAH ET tends to overestimate ET ZHONG ET AL.