Extending global river gauge records using satellite observations

Long-term, continuous, and real-time streamflow records are essential for understanding and managing freshwater resources. However, we find that 37% of publicly available global gauge records (N = 45 837) are discontinuous and 77% of gauge records do not contain real-time data. Historical periods of social upheaval are associated with declines in gauge data availability. Using river width observations from Landsat and Sentinel-2 satellites, we fill in missing records at 2168 gauge locations worldwide with more than 275 000 daily discharge estimates. This task is accomplished with a river width-based rating curve technique that optimizes measurement location and rating function (median relative bias = 1.4%, median Kling-Gupta efficiency = 0.46). The rating curves presented here can be used to generate near real-time discharge measurements as new satellite images are acquired, improving our capabilities for monitoring and managing river resources.


Introduction
Since the turn of the 20th century, river gauges have provided critical river discharge measurements across much of the world (Murphy 1904). Long-term gauge records provide a baseline for understanding changes in water resources caused by climate change (Milly and Dunne 2020) and land use change (Gerten et al 2008), including human modifications of river corridors (Remo et al 2012). River gauges also produce the raw information needed to inform water resource management and thus often form the basis of water usage treaties (Drieschova et al 2008, Gerlak et al 2011, Dawadi and Ahmad 2012. Yet most publicly available global gauge records are not immediately available, do not extend to the present, and are often discontinuous . High-latency and incomplete gauge records have real-world consequences. When near-real-time (NRT) gauge data are available, they are useful for early flood warning systems and other NRT water resource management applications (Gourley et al 2013, Bates et al 2021. Inaccessible NRT gauge data hampers understanding of rivers at the global scale and also the development of gauge-constrained hydrological products at an application-relevant latency (National Water Model 2016). Further, many global gauge records contain missing measurements, often due to instrument error or government restrictions on data access (Hannah et al 2011). For instance, 69% of the 30 959 gauge records in the global streamflow indices and metadata (GSIM) archive contain missing measurements . Further, incomplete gauge records exacerbate the placement bias of gauge networks (Krabbenhoft et al 2022). This spatial and temporal paucity of gauge data has motivated alternative approaches for observing discharge at large scales.
Satellite remote sensing of discharge can be used to address the shortcomings of gauge records (Smith 1997). To date, more than a dozen studies have focused on building rating curves between satellite river observations and gauge measurements (Gleason and Durand 2020). Rating curves represent an empirical relationship between an observable river parameter (e.g. width, water surface elevation) and river discharge. These rating curves can then be used to estimate discharge from the observable river parameter alone. Using optical imagery from satellite missions and sensors including, Landsat, Sentinel-2, and the Moderate Resolution Imaging Spectroradiometer, several studies have paired satellite widths with gauge records to develop width-based rating curves (Smith and Pavelsky 2008, Pavelsky 2014, Feng et al 2019. Radar altimetry data has also been used to develop elevation-based rating curves in gauged reaches (Kouraev et al 2004, Getirana and Peters-Lidard 2013, Tourian et al 2013, Paris et al 2016. Unlike optical sensors, radar altimeters can collect data at night and during cloudy conditions but they currently have a relatively coarse spatial resolution, limiting river elevation measurements only to very wide rivers (Birkett et al 2002, Birkinshaw et al 2010, Coss et al 2020, Nielsen et al 2022. Rating curves can also be developed by pairing surface reflectance with gauge records (Brakenridge et al 2012, Hou et al 2018, Tarpanelli et al 2013, Tarpanelli and Domeneghetti 2021, but use of surface reflectance alone prevents integration of observations from other sensors into the rating curves. Once developed, rating curves can provide valuable discharge information for water management purposes. For example, rating curves are capable of providing NRT discharge estimates where gauge data are unavailable (Riggs et al 2022). In addition, rating curves can supplement missing historic gauge measurements (Tourian et al 2017), improving our understanding of hydrological trends. Though numerous rating curve studies exist, they are typically limited to individual reaches or regions, making it difficult to assess their accuracy and usability at a broader scale (Huang et al 2018). Further, comparing rating curve approaches is difficult because many studies rely on different performance metrics. In this study, we compile the largest known dataset of publicly available river gauge data and analyze the temporal gaps of the compiled dataset. We then develop, assess the accuracy, and apply satellite-based rating curves to fill in the temporal gaps of gauge records.

Compiling and analyzing global gauge data
Here we collect and assess daily gauge data from a combination of international and national organizations to build an extensive global gauge database (table 1). We remove gauges located within ∼100 m of each other, assuming that they are redundant with one another (after Crochemore et al (2020), see table S1 for further information). All gauge databases used in this study are publicly available through a variety of web interfaces except for the Chinese Hydrology Project gauge data, which comprises less than 1% of gauges in this study. We assess the compiled gauge database to better understand spatial and temporal trends in gauge availability at the continental and global scales from 1900-2021. Specifically, we assess the number of operating gauges (with ⩾335 daily measurements in a given year) (U.S. Geological Survey 2019) and the proportion of operating gauges (the ratio of N operating gauges to N gauges ever installed) per year. We define gauge availability as the number of operating gauges per year. We also investigate trends in missing gauge measurements, by developing and applying a metric termed gauge record completeness (GRC) which is defined as, where N g represents the number of gauges, Nv represents the number of valid measurements, and Np represents the number of days between the earliest and most recent valid discharge measurement for gauge i. We compare our findings with the 8187 gauge records containing daily discharge measurements in the Global Runoff Data Centre (GRDC) database (The Global Runoff Data Centre 2022). The GRDC is used for comparison with our compiled gauge database as it represents the largest collection of publicly accessible global gauge records.

Remote sensing of river widths
Because our technique combines same-day gauge discharge and satellite width measurements, we only use gauge records containing discharge measurements after the launch of Landsat 5 (1 March 1984), the earliest satellite used in this study. As gauge stations are often located on narrow, stable reaches (Park 1977), we consider locations upstream and downstream of gauges to determine the most suitable site for developing a width-based rating curve. These locations (hereinafter referred to as nodes) and the corresponding widths are from the Surface Water and Ocean Topography (SWOT) River Database (SWORD) (Altenau et al 2021) and represent locations where SWOT is expected to provide river observations at ∼200 m spacing. The SWORD mean width attribute is derived from river width measurements at mean discharge (Allen and Pavelsky 2018), which correspond on average to mean width (Allen et al 2020). We limit the analysis to gauges with at least one SWORD node mean width ⩾120 m (after Ishitsuka et al 2021) within a 2 km Euclidean radius and a 10 km river network distance (figure 1(a)). As  some gauges lie on nearby tributaries, we further limit the analysis to gauges with a mean discharge ⩾20 cubic meters per second (cms) because, on average, this discharge corresponds to the minimum discharge for a river with a mean width ⩾120 m (Frasson et al 2019) (table S1). To calculate river widths, we use Google Earth Engine (Gorelick et al 2017) and a modified version of RivWidthCloud  to process Landsat-5, 7, 8 (30 m spatial resolution) and Sentinel-2 (10 m spatial resolution) imagery from 1 March 1984 to 27 August 2021 (see text S1 for more information).

Rating curve functions
We pair same-day gauge discharge with satellite width measurements at each node to develop four types of rating curve functions (figure 1(b)), further increasing the number of candidate rating curves for each gauge. The four rating curve types are at-a-station hydraulic geometry (AHG) (i.e. power-law function) (Leopold and Maddock 1953), piecewise linear regression (Lewis 1966, Mersel et al 2013, Elmi et al 2021, monotonic spline (Clarke et al 2000), and random forest (Kumar et al 2020). We select these four diverse function types because they are common approaches with their own respective strengths and weaknesses (see code repository for function parameterization). To eliminate unrealistic rating curves, we remove individual rating curves that do not exhibit an overall positive relationship between width and discharge. In total, 145 232 rating curves are developed and 52 178 rating curves are flagged for errors or insufficient training data and removed (see table S1 for more information). We use the oldest 70% of the paired widthdischarge data to calibrate the rating curves and use the remaining paired data for validation. We assess the performance of all node-level rating curves using the Kling-Gupta efficiency (KGE) (see table S2 for error metric equations) (Gupta et al 2009) and designate the highest performing node-level rating curve as the rating curve for that gauge (figure 1(c)). For example, if there are 20 viable SWORD nodes for a single gauge, 80 rating curves are generated (20 SWORD nodes × 4 rating curve types) and only the highest performing rating curve would be used for supplementing the gauge record. We apply the commonly used KGE for the rating curve assessment because its performance integrates both model variability and bias in a more balanced manner than similar metrics such as the Nash Sutcliffe efficiency (Nash andSutcliffe 1970, Gupta et al 2009). To compare the performance of our remote sensing approach to the performance of a global hydrologic model, we compare our rating curves to the Global Reach-level A priori Discharge Estimates for SWOT (GRADES) hydrological model which contains daily discharge estimates at 2.94 million river reaches from 1979-2014 (Lin et al 2019). We select GRADES for comparison because it will provide a priori discharge estimates for the SWOT discharge product and we seek to determine whether this study's rating curves could be used to improve upon these a priori discharge estimates.

Gaps in the global gauge record
Analyzing a total of 45 837 global river gauge records, we characterize the spatial and temporal trends of global river gauge availability (figures 2 and S1). We find that global gauge availability (N operating gauges) increased from 1900 until the 1980s and has since remained relatively stable ( figure 2(b)). However, this global trend in gauge availability is largely dictated by the North American data because this region contains 67% of the gauges in our database followed by Oceania (10%), Europe (9%), Asia (6%), South America (5%), and Africa (3%). At the continental scale, we find that North American gauge availability has remained relatively steady since ∼1980 whereas Asia, Oceania, and South America steadily increase in gauge availability over time ( figure 3). Conversely, from 1980 to 2015, African and European gauge availability declined by 47% and 20%, respectively.
Similarly, we find that the proportion of operating gauges (the ratio of N operating gauges to N gauges ever installed) tends to increase over time on most continents with the exception of Africa and Europe which have seen a declining proportion since the 1980s (figure 3). We also investigate instances of missing gauge measurements to understand the causes of gauge record fragmentation. Globally, the median duration a gauge is offline is five consecutive days, which could lead to missed observations of important flow events such as floods. Gauges go offline a median of once throughout their lifespan with the median number of offline events per gauge higher in Africa (10) and South America (8) than in Oceania (3), Asia (3), Europe (1) or North America (1). Across all gauges, the global GRC (equation 1) is 86%. At the continental scale, we find that GRC is less in Africa (87%), Asia (87%), and North America (84%) than in Europe (92%), Oceania (91%), and South America (91%) (figure 3). GRC tends to increase over time, with periodic declines that often occur during socially turbulent time periods (figure 3). We emphasize that these findings represent correlation and not necessarily causation between historic events and GRC. Our gauge analysis is performed at the continental to global scale so we cannot justify making inferences related to individual countries' political and social history. Thus, the following is not an exhaustive list but rather a broad continental-scale overview of coinciding chaotic historical events and declines in GRC. We base our historical analysis on the work of Findley and Rothney (2011) which provides detailed information on the history of the 20th century. African GRC decreases from 1939-1945 (two-sample, onesided Welch's t-test p < 0.001) and from 1988-1991 (p < 0.05) which is in accordance with World War II and the dissolution of the Soviet Union, respectively. Asian GRC improves for much of the time series with periodic decreases from 1939-1945 (p < 0.05) and the 1970s which aligns with World War II and Southeast Asian wars, respectively. European GRC shows three distinct declines: 1914-1918 (p < 0.001), 1929-1945  (p < 0.001), and 1988-1991 (p < 0.001). These declines in European GRC occur simultaneously with World War I, the Great Depression coupled with the Spanish Civil War and World War II, and the Soviet Union's breakup. North American, Oceanian, and South American GRC broadly increase over time with a decline in North and South American GRC from 1929-1939 (p < 0.001), which is in sync with the great depression. Across nearly all continents, we find that most large dips in GRC correlate with tumultuous historical events.

Supplementing global gauge records with satellite observations
To fill in gaps in the global gauge record with satellite measurements, we identify 2168 gauges that meet the requirements discussed in section 2.2. We use an average of 111 paired width-discharge data to calibrate the rating curves (N = 240 735) with an average of 48 paired data for rating curve validation (N = 103 246). The optimal node-level rating curve is often located some distance from the gauge location (mean streamwise distance of 1281 m) and we find that upstream and downstream nodes perform equally well. Of the optimal rating curve fit at each gauge, our technique selects AHG the most frequently (60%), followed by monotonic spline (20%), piecewise linear regression (15%), and random forest (5%). AHG power-law rating curves have a mean a coefficient of 33 ± 53 (1-sigma variation) and a mean b exponent of 0.42 ± 0.22. We use RiverAT-LAS (Linke et al 2019) to assign Strahler stream order to each gauge and find that the proportion of AHG rating curves declines in stream orders greater than 6 (figure S2).
Applying our width-based rating curve approach to each of the 2168 gauges adds 279 937 daily discharge observations to global gauge records, where 246 152 of these discharge observations extend gauge records beyond their lifespan and 33 785 discharge observations fill historical gaps in gauge records (figure 4). Using the BasinATLAS Level 3 product (Linke et al 2019), we find that gauge records in arid environments (average annual precipitation/potential evapotranspiration <0.5) are filled in at a higher mean annual rate than in more humid regions (5.0 vs. 3.3 observations/gauge/year, respectively), likely due to a higher proportion of cloud-free imagery in arid environments (Ju and Roy 2008). The infilling rate is also impacted by the number of available satellitebased sensors during a given time period. From 2017 to 2021, the heightened availability of Sentinel-2 data combined with the dearth of more recent gauge data ( figure 4(b)) quadruples the infilling rate to 26 094 observations/year. The decommissioning of satellites has the opposite effect, with the notable example of a major decrease in observations/year in 2012 due to instrument failures on Landsat 5 prior to the commissioning of Landsat 8 (Neigh 2021).
The rating curve approach presented here produces a median rBias of 1.4% (mean = 4.6%) and a median absolute rBias of 14% (mean = 21.5%), substantially lower than the typical bias produced from hydrological models ( figure 4(c)). In addition, we find a median normalized root-meansquare error (NRMSE), relative root-mean-square error (RRMSE), and KGE of 63%, 83%, and 0.46, respectively. Ninety seven percent of the rating curves have a KGE >-0.41, implying that the vast majority of locations improve upon using the mean observed discharge for supplementing the gauge record (Knoben et al 2019). We find that rating curve performance remains relatively stable regardless of stream order (figure S3). We compare GRADES daily discharge to 2071 of the gauge records used in our analysis as these records overlap with the GRADES simulation data. As expected, we find that GRADES can supplement the global gauge record at a higher rate than our rating curve approach ( figure 4(b)). However, our rating curve method outperforms GRADES across all gauges in terms of rBias, NRMSE and KGE, and particularly outperforms GRADES in arid regions (figures 4(c)-(e)). GRADES poor performance in arid regions has been attributed to inaccurate meteorological inputs (Beck et al 2015) and could also be related to the non-perennial and flashy nature of rivers in arid regions. In contrast, our rating curves are gauge-constrained and based on direct satellite observations, allowing them to estimate discharge without the need to model hydrological processes.

Discussion
In contrast to previous gauge availability studies that rely entirely on the GRDC (Hannah et al 2011) or use some but not all of the national organizations found in the GSIM database (Crochemore et al 2020), we do not find a decline in record availability (the number of operating gauges per year) over the last four decades (figure 2). These differences in findings are driven by newfound increases in gauge availability across much of Asia, Oceania and South America, as well as a less rapid decline in Europe since the 1980s. In Asia, an overall increase in continental gauge availability is contrasted by the absence of gauge data in China and Russia after 2004 and 2011, respectively. Further, the Asian gauge network is unevenly spatially distributed with only ∼13% of Asian gauges located in China and Russia even though these countries cover approximately half of Asia's landmass. International gauge databases such as the GRDC have proved instrumental in providing access to global gauge data for hydrological studies (Addor et al 2020). However, pulling data directly from publicly available national agency websites is a promising way to improve data latency and coverage as international gauge databases are updated infrequently and are insufficient to characterize the global availability of gauge records ( figure 2(b)).
The rating curves developed here can be used to improve river flow measurement latency in large rivers with near-zero rBias on average. We find that our AHG b exponent values are within the range of other gauge-based AHG studies (Dingman 2007, Gleason 2015, which is likely related to gauges being preferentially placed along stable reaches (Allen and Pavelsky 2015). We speculate that human modifications along large rivers or fundamental differences in geomorphic scaling relationships (e.g. width/depth ratio) may be the reason for the decline in AHG performance in higher stream order rivers but future studies should explore this relationship further. As 77% of the gauge records collected here do not contain data within a month of access, low-latency satellite data (Landsat: 0-26 d, Sentinel-2: 0-7 d) can be used to improve the amount of NRT information on river flows (U.S. Geological Survey 2018, ESA 2022). For example, 30% of our satellite-based gauge record additions are from 2020 or 2021 which is driven by the lack of NRT gauge data. While hydrological models can also be used to supplement gauge records, our rating curves outperform the GRADES hydrological model and produce a lower bias, on average, than several additional large-scale hydrological models (Zhao et al 2017, Harrigan et al 2020. Although our rating curve approach has nearzero rBias on average and provides NRT data, it is unable to match the continuous temporal resolution and contiguous spatial coverage of hydrological models ( figure 4(b)). Therefore, perhaps the best solution could be obtained by incorporating these rating curve based estimates into a hydrologic model through data assimilation.
The raw rating curves produced by this study can also be used in conjunction with additional satellite missions to further improve access to NRT discharge estimates. For instance, the SWOT satellite will provide global discharge estimates by combining SWOT river observations with gauge and theoretical information (Durand et al 2014, Larnier et al 2021. The official SWOT discharge products require prior discharge estimates from historic gauge data or modeled discharge and performance is limited in places by high bias in the prior discharge estimates (Durand et al 2016, deFrasson et al 2021. Thus, our near-zero rBias rating curve approach is potentially advantageous for providing this critical prior information alongside modeled data from GRADES in these locations.
Although the rating curves presented here can provide valuable NRT information, there are limitations involved in the approach presented in this study. First, our rating curve technique is limited by the spatial resolution of the imagery used, restricting rating curve development to only ∼5% of publicly available gauges. The satellite optical sensors used in this study are inherently limited by cloud cover and nighttime conditions (King et al 2013). Thus, the implementation of SWOT or other radar observations into our rating curves will provide additional data for supplementing global gauge records at a greater temporal resolution. With the expected proliferation of satellite observations in the coming years (Rosen et al 2017, Blumstein et al 2019, Wulder et al 2019, future efforts could integrate river observations from a variety of available sensors and sensor types (optical, microwave, altimeters) to improve the spatial and temporal capabilities for supplementing global gauge records. Additional observations could be extracted from commercial satellite data (Stringham et al 2019, Ignatenko et al 2020, Kulu 2021. Ultimately, a multisensor approach for satellite remote sensing could serve to improve our understanding of global river discharge dynamics.

Conclusions
We find that satellite remote sensing can supplement global river gauge records by filling in temporal gaps and extending high-latency databases to NRT. In contrast to previous studies that largely rely on international gauge databases, our analysis of both national and international gauge databases indicates that global gauge availability is not declining but has been relatively stable since ∼1980. Specifically, we find that while gauge availability is decreasing on some continents (Africa and Europe), many continents are showing increases in the number of publicly available gauges (Asia, Oceania, and South America) or are remaining steady (North America) (figure 2). We find that gauges typically go offline for short periods (∼5 consecutive days) and a much larger percentage of missing measurements occurs during historical periods of social upheaval (figure 3). The rating curve technique presented here provides a near-zero bias and high KGE approach for providing historic and NRT discharge estimates, which can be used to fill in and extend gauge records (figure 4). Further, our rating curves can provide NRT access to discharge data, unavailable at 77% of gauges. Finally, the approach developed here can be used in conjunction with future satellite missions to significantly improve the spatial and temporal density of remotely sensed discharge data to further supplement global gauge records.

Acknowledgments
This work was supported by the NASA SWOT Science Team (NNH19ZDA001N-SWOTST), NASA's Terrestrial Hydrology Program (NNH17ZDA001N-THP), the Texas A&M Presidential Excellence Fund, and the Texas Space Grant Consortium. T M Pavelsky's work on this project was supported by a contract from the SWOT Project Office at the NASA/Caltech Jet Propulsion Lab. C H David was supported by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with NASA.