A synthesis of hydroclimatic, ecological, and socioeconomic data for transdisciplinary research in the Mekong

The Mekong River basin (MRB) is a transboundary basin that supports livelihoods of over 70 million inhabitants and diverse terrestrial-aquatic ecosystems. This critical lifeline for people and ecosystems is under transformation due to climatic stressors and human activities (e.g., land use change and dam construction). Thus, there is an urgent need to better understand the changing hydrological and ecological systems in the MRB and develop improved adaptation strategies. This, however, is hampered partly by lack of sufficient, reliable, and accessible observational data across the basin. Here, we fill this long-standing gap for MRB by synthesizing climate, hydrological, ecological, and socioeconomic data from various disparate sources. The data— including groundwater records digitized from the literature—provide crucial insights into surface water systems, groundwater dynamics, land use patterns, and socioeconomic changes. The analyses presented also shed light on uncertainties associated with various datasets and the most appropriate choices. These datasets are expected to advance socio-hydrological research and inform science-based management decisions and policymaking for sustainable food-energy-water, livelihood, and ecological systems in the MRB.


S. No. Data
Source Native resolution Data type Remarks or weblinks* www.nature.com/scientificdata www.nature.com/scientificdata/ As such, precipitation products are seemingly many; however, their spatial resolution and temporal availability period limit their utility for many applications; for example, process-based hydrological modeling often requires sub-daily data (e.g., Kabir et al. 44 ), but many products noted above include only daily datasets. Table 2 summarizes, to our knowledge, the existing precipitation products, mostly global, with details on their resolution and availability period.
Many studies, especially on hydrological modeling, require meteorological input other than precipitation, which includes temperature, solar radiation, humidity, surface pressure, and wind speed. Such data are largely lacking for the MRB, except for the sparse gauge-based data from the MRC (Fig. 1). Therefore, modeling studies generally employ data from global products, which are primarily based on atmospheric reanalysis such as the ECMWF Reanalysis v5 (ERA5; Hersbach et al. 99 ). There are numerous other global products that could be used for basin-scale modeling, which are derived from different reanalysis datasets. These include the Princeton Global Forcing data 100,101 , WATCH Forcing methodology applied to ERA-Interim reanalysis data (WFDEI 102,103 ), meteorological forcing data of the third Global Soil Wetness Projects (GSWP3; Kim 104 ), and WFDEI5 over land merged with ERA5 over the ocean (W5E5; Lange et al. 105 ). Brocca et al. 106 proposed an algorithm to estimate the effective rainfall data from in-situ soil moisture data (SM2RAIN). Later, this algorithm has been applied to various satellite-based soil moisture datasets to estimate the global effective rainfall (e.g., SM2RAIN-CCI (Ciabatta et al. 107 ), SM2RAIN-ASCAT (Brocca et al. 108 ), and GMP + SM2RAIN (Massari et al. 109 )). One common limitation in many of these products is the coarse spatial resolution (typically 0.5° ~50 km at the equator), which limits the application to only basin-scale modeling studies 19,44 .
To overcome the limitations related to spatial-temporal resolution and inherent biases, recent efforts have led to the development of higher resolution products such as the Ensemble Meteorological Dataset for Planet Earth (EM-Earth) data at 0.1° (~10 km at the equator) spatial resolution over global land areas from 1950 to 2019 110 . These data have 25 ensemble members enabling uncertainty analyses and sensitivity test in hydrological modeling. Another such recent product is the Climatologies at high resolution for the earth's land surface areas (CHELSA) data (Karger et al. 111 ; https://chelsa-climate.org/), also available at 30 arc seconds (~1 km) globally. However, both of these products are available at a daily time step, limiting the utility to models that only resolve water balance; land surface models that resolve energy balance typically require sub-daily datasets 112 . Nevertheless, the EM-Earth ensemble datasets have the potential to be useful for probabilistic climate and hydrological modeling. We have synthesized these datasets or have noted relevant sources where data are readily accessible.
Hydrological data. Streamflow and water level. The primary source of the observed hydrological data in the MRB is the Mekong River Commission (MRC), which provides gauge-based data on river discharge and water levels at over 29 and 47 stations for streamflow and water level, respectively, across the basin (available through formal agreement or for purchase). Water level observations are also available from other sources such as the Cambodia Ministry of Water Resources and have been presented in the published literature (e.g., Arias et al. 113 ). Observed data for the Chinese portion of the basin (i.e., Upper MRB (UMRB)) are generally not available to the international community but have been presented in some journal articles 3,26,37,114,115  www.nature.com/scientificdata www.nature.com/scientificdata/ Gajiu 116 , and Changdu 117 stations in China; we have digitized these from the published literatures. The Global Runoff Data Center (GRDC) provides some streamflow data for the MRB within its global database but only for a small number of stations, which are included in the MRB database. Here, we present the complete information on available data from the MRB and other sources, along with some infographics.
Evapotranspiration (ET). Similarly, ET is typically not measured in-situ due to the difficulty of deploying a network of ground-based networks over the entire area in the MRB. Therefore, satellite-based ET products, which provide a continuous record of ET at a global scale with a relatively high temporal resolution, are often used as an alternative. Some global ET products that have been used in the MRB include the water balance (WB; Zeng et al. 118 ) based ET, GLEAM product (Martens et al. 119 124 have evaluated the performance of these products in the MRB and found that their accuracy can vary depending on specific conditions and characteristics of the region. Chen et al. 124 concluded that Moderate Resolution Imaging Spectroradiometer (MODIS) ET underperforms in the MRB compared to other selected datasets. While Hu and Mo 123 compared model simulated ET with satellite datasets and suggested that, in the MRB, GLEAM ET performs comparatively better than other products. Further, www.nature.com/scientificdata www.nature.com/scientificdata/ MODIS ET does not include data for land cover types specified as unclassified, urban, wetlands, perennial snow/ ice, and permanent water bodies. Here, given certain uncertainties in both GLEAM (version 3.6b) and MODIS (version 6.1, gap filled) ET datasets and lack of observational data, we compare the two to demonstrate how they differ spatially and temporally.
Surface water. Monitoring surface water volume is a crucial aspect of water resource management, as it helps understand the availability and dynamics of water in a region. Surface water can be monitored using surface water area and water level 125,126 . Satellite altimetry datasets, such as those provided by TOPEX/Poseidon, Jason-1, Jason-2, Jason-3, and Envisat, use radar measurements to determine the height of the water surface and satellite imagery, such as that provided by MODIS, LANDSAT, and Sentinel, can be used to measure surface water area. For example, European Commission's Joint Research Centre (JRC), developed by Pekel et al. 127 used LANDSAT data at a spatial resolution of 30 meters to monitor surface water extent from 1984 to 2015. However, the temporal resolution of these datasets is relatively coarse, and they are available only in the form of percentage water occurrence at the monthly scale or as yearly classification. Moreover, there are limited images only for MRB which are cloud-free 9 . The recently launched Surface Water and Ocean Topography (SWOT) mission is expected to enable us address some of these limitations and greatly improve our ability to monitor surface water volume, especially by providing high-resolution data on surface water area and water level. Further, there are other satellite-based surface water products such as those generated by Ji et al. 128 using MODIS data, which are available at the daily interval at the spatial resolution of 500 m and for the 2001-2016 period (data source: http://data.starcloud.pcl.ac.cn/resource/9). Here, we have processed and compared the two remotely sensed surface water products by Pekel et al. 127 and Ji et al. 128 for MRB, and present the processed surface water datasets to the community.
Soil moisture. In-situ soil moisture data for the MRB are limitedly available, if not non-existent at the basin-scale. As a result, the only choice is to use globally available remote sensing-based soil moisture products. For example, soil moisture data are available from the i) Soil Moisture Active Passive (SMAP; Entekhabi et al. 129 ) at 9 km spatial resolution, ii) Soil Moisture and Ocean Salinity Level 3 (SMOS L3; Jacquette et al. 130 ) at 25 km, iii) European Space Agency Climate Change Initiative (ESA-CCI SM v2.7; Liu et al. 131 ; Wagner et al. 132 ) at 25 km, and iv) Global Land Evaporation Amsterdam Model (GLEAM: Martens et al. 119 ) at 25 km. Recently, the SMAP soil moisture data have been downscaled to a finer spatial resolution of 1 km globally 133,134 as well as locally 135 . In this study, we focus on the downscaled 1 km SMAP product by Fang et al. 133 while also noting the utility of the other products. Among such limitedly available and disparate observed data are the observations at five locations (Chaiyabhumi, Srisaket, Amnatcharoen, Sakonnakhon, and Bungkan) in Thailand, available at 5-minute intervals from 14th December 2017 to 12th February 2019 and provided by an individual scholar (see Acknowledgment section).
Groundwater. Groundwater data in the MRB are collected by respective government agencies in each member country. For example, the National Centre for Water Resources Planning and Investigation (NAWAPI) in Vietnam, the Department of Groundwater Resources (DGR) in Thailand, the Ministry of Water Resources and Meteorology (MOWRAM) in Cambodia, and the Department of Water Resources under the Minister of Natural Resources and Environment (DWR-MONRE) in Laos conduct groundwater monitoring. However, these datasets are generally not available to the public, nor included within the MRC's database. Some of the datasets (e.g., from NAWAPI) are available for scientific research conducted with an in-country team but are restricted from broader sharing. Further, numerous previous studies have collected groundwater data on an individual basis or obtained from certain partner agencies in the region. Yet, the data have not been shared beyond certain graphics in journal articles. Here, we have digitized all published data, obtainable through our best efforts from published sources 46,[54][55][56] , and identified various other sources via which groundwater data can be obtained, for example, through formal agreements with respective agencies. Details are provided in Table S1.
Dam data. Recently, over 100 hydropower dams have been constructed across the MRB, dramatically increasing reservoir storage capacity from ~5 to ~70 km 3 during 2010-2020 25 . Therefore, dams and their operation have become crucial aspects of hydrologic and ecosystem studies in the MRB, which demand reliable data on the attributes of existing and planned dams as well as on the way reservoirs are operated. Globally, data on large dams are available through the database of the World Register of Dams (WRD), maintained by the International Commission of Large Dams (ICOLD). These data have been synthesized, for example producing theGlobal Reservoir and Dam (GRanD) data 157 and used in many global studies 112,[158][159][160] . However, these global data include only a few large dams in the MRB, leaving a major information gap regarding the smaller or recently built dams or those that are planned. The GlObal geOreferenced Database of Dams (GOODD; Mulligan et al. 161 ) includes larger number of dams compared to GRanD and the georeferenced global dams and reservoirs (GeoDAR; Wang et al. 162 ) and provides richer information on global dams. Yet, the necessary dam attributes (e.g., dam height and reservoir storage capacity) are not comprehensively included in most of these datasets. Recently, Zhang and Gu 163 developed Global Dam Tracker (GDAT), a comprehensive dam database which includes more than 35,000 global dams with their location, catchment area, and other attributes. The GDAT dataset includes attributes of 466 dams in the MRB. Further, there are notable discrepancies or missing attributes in many of these products (e.g., Shin et al. 45 ).
In this study, we present the data from the Research Program on Water, Land, and Ecosystems (WLE Mekong; https://wle-mekong.cgiar.org/) as the base product and enhance the database by using information from various other sources. Note that WLE is the primary data source for GDAT for the MRB region. Specifically, building on the efforts of Shin et al. 45 , we conducted a thorough inspection of the existing database, made manual corrections using various independent sources (e.g., Google Earth, internet resources on individual dams, published literature and www.nature.com/scientificdata www.nature.com/scientificdata/ reports), and further verified with credible sources (e.g., Yigzaw et al. 164 ; Yun et al. 165 ; Galelli et al. 75 ; Schmitt et al. 166 ). We also selected dams that have either or both the dam height and reservoir capacity as these are the two basic attributes for dam impact studies. Finally, we have selected large dams, satisfying on one of the following criteria: (1) Table 3. Among these land use datasets, we selected ESA-CCI (https://www.esa-landcover-cci.org/) land use data as demonstration in this study owing to its relatively longer temporal coverage and high spatial resolution.
Crop datasets are crucial for accurately modeling hydrological and agricultural processes but datasets on crop types and cropping patterns are not specifically available for the MRB. Thus, as for many other regions, the alternative is to use crop types from Remote sensing. The commonly used Leaf Area Index (LAI) data, an important modelling attribute, in many MRB studies are based on MODIS products (e.g., Son et al. 65 ; Hu and Mo 173 ). Another critical parameter for understanding food security and agricultural productivity is crop yield, which is not available at a basin-wide scale. Therefore, studies in the MRB use global annual data on crop yield such as Food and Agriculture Organization Corporate Statistical Database (FAOSTAT) 76,174,175 . For this study, we obtained country-based annual crop yield data for the period of 1961-2021 for the Lower MRB (LMRB) countries (Cambodia, Laos, Thailand, and Vietnam) from FAOSTAT. The datasets include annual crop yield for crops such as rice, maize, banana, and sugarcane, etc.
Similar to crop yield, crop calendar datasets are available at the global scale. Crop calendar datasets are necessary inputs in hydrological-agricultural modeling, and crucial products for broader agricultural and food security studies. The International Production Assessment Division (IPAD) of the U.S. Department of Agriculture (USDA) Foreign Agricultural Service (FAS) provides global crop calendar data for planting, mid-season, and harvesting periods for grains, oilseeds, and cotton. In addition to IPAD, the Group on Earth Observations Global Agricultural Monitoring (GEOGLAM; Whitcraft et al. 176 ) has developed crop calendar data using MODIS products for several countries at the national and sub-national scales. Furthermore, Jägermeyr et al. 177 created a gridded dataset of crop calendars for the Global Gridded Crop Model Intercomparison (GGCMI) at a 0.5° spatial resolution. The datasets were generated by combining information from nine observational sources at 0.5° land grid cells for 18 different crops, distinguishing between rainfed and irrigated systems. The dataset includes information on planting day, maturity day, growing season length, primary data source, and the fraction of harvested area. The GGCMI datasets are produced and validated using multiple sources and are gridded products that can be readily used for modeling purposes. We utilized the GGCMI for our study by extracting the MRB region from the global database.
irrigated area and irrigation water use. Irrigation consumes a significant portion of global water withdrawals, accounting for ~70% of total human water use 178 . This is particularly relevant for food security, as a significant portion of global food production (33-40%) is derived from irrigated cropland 179 . Therefore,  www.nature.com/scientificdata www.nature.com/scientificdata/ understanding the spatial distribution of irrigation is essential for managing water resources and ensuring food security; this is crucial in the MRB in light of the growing impacts of climate change and dams on fishery systems (e.g., Sabo et al. 77 ; Ziv et al. 8 ; Veldkamp et al. 180 ) and the potential need for irrigation expansion 76,181 .
However, there are no specific datasets on irrigated areas and irrigation water use for the MRB. As a result, studies on MRB have relied on globally available datasets. The latest version of global maps of irrigated areas provided by Food and Agriculture Organization (FAO) and developed by Siebert et al. 182 are available at 5 arcminutes which has been widely used globally to identify the irrigated area and irrigation water use. Moreover, several studies combined various datasets to generate global maps on irrigated areas (e. g. Zabel et al. 183 ; Salmon et al. 184 ; Meier et al. 185 ). Among different global products, FAO 182 based irrigated area and irrigation water use data are commonly used in the Mekong. Therefore, we have selected FAO data for this study. However, datasets for Cambodia are missing in global databases. Thus, we acquired the Cambodian census data and subsequently processed these. As a result, the gap in global datasets is filled by processed census data for Cambodia. Though, similar processing can be performed for other countries, however census data is not easily accessible for those countries. Furthermore, ongoing agricultural census surveys in other parts of the basin will be extremely valuable for the research and policy makers. ecological data. Nutrients and sediment data. The MRC provides some data on nutrients and sediment, but these datasets are even more sparse than streamflow data and are not freely available. Specifically, the data include Nitrite-Nitrate (NO 3 -N), Total Phosphorous (TP), and Dissolved Oxygen (DO). The MRC Discharge Sediment Monitoring Project (DSMP; Koehnken 186 ) that started in 2009 monitors sediment data at certain locations in the downstream regions of the MRB 187 . Sediment concentration estimates are also available from satellite remote sensing, developed by using empirical or physics-based approaches 187,188 . Here, we present and examine the data from the MRC and identify various other data sources.
Wetland and inundation data. Accurate wetland datasets are crucial for research on climate change, biodiversity preservation, and the implementation of effective land use policies and wetland conservation strategies. Wetland related studies in the Mekong have primarily used global datasets that are based on satellite observations due to lack of basin wide is-situ data availability. For example, Cho and Qi 70 used multi-sensor approach to overcome limitations in detecting wetland inundations from 2014 to 2021 in Southeast Asia. Several studies have also identified wetlands in the MRB; however, these are limited to the Mekong Delta 189,190 . On a global scale, Sustainable Wetlands Adaptation and Mitigation Program (SWAMP) wetland maps were produced by Gumbricht et al. 191 which include the wetland categories identified by Ramsar (2013). Furthermore, Tootchi et al. 192 identified global wetlands based on surface water imagery and groundwater constraints. In this study, we provide the comparative evaluation of wetland based on Gumbricht et al. 191 and Tootchi et al. 192 for MRB.
Several studies have used satellite products to generate inundation datasets globally [193][194][195] . Here, we use the GIEMS-D15 (Global Inundation Extent from Multi-Satellites -Downscaled to 15 arc-seconds; Fluet-Chouinard et al. 193 ) dataset for inundation maps as the dataset were made available by the authors. Based on topographic indices, the GIEMS-D15 dataset was created by downscaling monthly inundation observations from multiple satellites over a 12-year period from 1993 to 2004 194,195 to a finer grid resolution of 15 arc-second pixels (~500 m at the equator). However, inundation in the MRB-especially in its downstream regions-is strongly related to precipitation seasonality and flow regulations by dams rather to topography, therefore other methods such as normalized difference vegetation index based flood inundation 196 than downscaling the data to higher resolution could be more reliable in MRB. Nonetheless, in this study we present the GIEMS-D15 based inundation datasets for the MRB region.
GHG emission data. Studies on GHG emission in the MRB have focused primarily on emissions from rice cultivation in the Mekong Delta 197,198 . Some studies have investigated alternate farming methods to reduce the GHG emissions in the Delta, but these are rather limited [199][200][201] . Moreover, a handful of studies have also estimated GHG emissions from hydropower dams in MRB (e.g., Räsänen et al. 30 ; Shi et al. 202 ; Wang et al. 36 ). These studies have produced certain GHG datasets, but a complete timeseries and for the entire MRB is lacking. Therefore, for basin wide studies, global GHG datasets have been used. Global Emissions Database for Global Atmospheric Research (EDGAR; Crippa et al. 203 ) v4.3.2 is the primary and most reliable source among gridded GHG datasets. The EDGAR dataset compiles anthropogenic emissions data for CO 2 , CH 4, and N 2 O based on international statistics and emission factors. Moreover, country specific annual GHG datasets for CO 2 , CH 4 , and N 2 O are also available from Ritchie et al. 204 (OURWORLDINDATA: https://ourworldindata.org/greenhouse-gas-emissions). Here, we employed the EDGAR datasets to infer insights on GHG emissions, which is available at 0.1° (~10 km) spatial resolution and is comprehensive in terms of covering GHG emissions from local and global scales; we consider this dataset as a reliable alternative in the absence of local datasets 205,206 . Socio-economic data. Helping advance scientific research and inform science-based management decisions and policymaking for sustainable transboundary basin management requires not only biophysical data (e.g., water, climate, and nutrients), but also socioeconomic data. These data are crucial, for example, to better understand the interactions among climate, water, and societies and ensure food, energy, livelihoods, and water securities under climate change and growing human influence on water systems 1,3,11 . In this study, we synthesize socio-economic data for the four LMRB countries (i.e., Cambodia, Laos, Thailand, and Vietnam), which are obtained from various sources including government websites, the National Institute of Statistics for Cambodia, and the Lao Statistics Bureau for Laos. We further combined these datasets with those available from public repositories such as the OpenDevelopment Mekong (https://opendevelopmentmekong.net/) and the Socioeconomic Data and Applications Center (SEDAC: https://sedac.ciesin.columbia.edu/data/sets/browse). These datasets cover www.nature.com/scientificdata www.nature.com/scientificdata/ a range of attributes including population demographics, agriculture, gross domestic product, housing, forestry, fishery, road networks, and internal displacement. However, these data are often limited in terms of spatial and temporal coverage, as detailed in Table 4.
High resolution gridded population and Gross Domestic Product (GDP) data are key to understanding and better predicting exposure and vulnerability of socioeconomic activities to future climate extremes and developing improved adaptation and mitigation strategies 207 . Gridded population of the World (GPWv4; Doxsey-Whitfield et al. 208 ) datasets have been extensively used in socioeconomic and environmental studies, such as vulnerability mapping, disaster impacts, and health implications of environmental change [209][210][211] . However, for our study we used GPWv4 population datasets at 30-arc second (~1 km) spatial resolution and projected population datasets from SEDAC 212 at 1/8 th degree spatial resolution. Furthermore, we utilized gridded GDP data from Kummu et al. 24 at 10-year interval and gridded GDP projections datasets from Wang and Sun 207 which are consistent with the shared socioeconomic pathways (SSPs). We further provide a comparison between population and GDP projections for all six Mekong countries.

Data Records
The synthesized datasets are available in the Zenodo repository 213 (https://zenodo.org/record/7803254). The uploaded datasets are optimized considering user convenience and data size reduction. For example, EM-Earth precipitation and temperature, GLEAM ET, GHG emissions, digitized groundwater, population projections, GDP projections, ground observations of soil moisture, and digitized streamflow datasets are provided in text format. The EM-Earth precipitation and temperature, GLEAM ET, and GHG emissions are gridded datasets with first two rows as locations (longitude and latitude), initial columns as time series (e.g., year, year-month, year-month-day), and rest of the columns as data time series. Moreover, first two columns of populations and GDP projection datasets are gridded locations (longitude and latitude) and the rest of the columns show data for  www.nature.com/scientificdata www.nature.com/scientificdata/ base year or projected years. Digitized groundwater, ground observations of soil moisture, and digitized streamflow data files contain time series in the initial columns and followed by corresponding data in the last column. Data on crop yield, which is country-level annual data, is presented with year in the first column and crop types in the first row; dam attributes (first row) are stored in excel files. GeoTIFF image format is utilized for MODIS ET, irrigated area and irrigation water use, LULC, population, GDP, surface water, and wetland datasets. Soil moisture datasets are stored in MATLAB (.mat) files. Each data folder includes a "Readme" file that provides detailed data description, including the original source, where relevant.
The publicly available datasets such as, satellite precipitation and temperature, ET, surface water, satellite soil moisture, LULC, crop yield, crop calender, wetlands, GHG, and socio-economic datasets are freely available for download www.nature.com/scientificdata www.nature.com/scientificdata/ from the original sources. Additionally, in-situ datasets from the MRC, including precipitation, temperature, wind speed, sunshine hours, specific humidity, streamflow, water level, nutrients, and sediment can be obtained through formal Procedure for Data and Information Exchange and Sharing (PDIES); these data are open to member countries of LMRB and to certain extent the MRC stakeholders 213 .

technical validation
Meteorological data. Among various hydrometeorological datasets identified in methods section, we find the EM-Earth data to be i) relatively inclusive of most climate variables required for analyses and modeling and ii) of reasonable spatial resolution. The dataset also includes multiple ensemble members useful for uncertainty quantification. Thus, we present an analysis of this product, focusing on precipitation and temperature ( Fig. 1 and Fig. S1), the two variables of primary interest in many hydrological and ecological studies. Among EM-Earth, APHRODITE, TRMM, IMERGE, Princeton (He et al. 100 ), and ERA5 precipitation datasets, EM-Earth data show better results when compared against gauge-based data from the MRC at selected locations, except at Kratie (Fig. S1), indicating high accuracy of the ensemble-mean EM-Earth data. Substantial spatial heterogeneity can be seen in precipitation (Fig. 1a) and temperature exhibits a strong north-south gradient (Fig. 1d). In Laos, Vietnam, and the eastern half of Cambodia, annual precipitation is higher compared to other regions in the www.nature.com/scientificdata www.nature.com/scientificdata/ MRB (Fig. 1a). A higher mean annual temperature in Cambodia, Thailand, and the Mekong Delta was found compared to other parts of the basin (Fig. 1d).
Additionally, we compare the spatial patterns of precipitation and mean temperature for three different datasets: EM-Earth ensemble mean, ERA5, and APHRODITE ( Fig. 1a-f), revealing interesting patterns and tendencies. The APHRODITE precipitation was comparatively lower than the other two products in Laos, but the three temperature products display similar spatial patterns across the basin. This suggests that while there may be some differences in the precipitation data, temperature data are more consistent across different sources.
Hydrological data. Streamflow and water level. We evaluate the availability and trends in digitized and MRC-based streamflow and water level at various locations in the MRB. In terms of streamflow and water level data, there are more stations with positive trends than with negative trends (Fig. S2a, b). The alternate positive and negative trends in the streamflow and water level data could be due to seasonal shift in water availability in the streams and different time-period considered to evaluate the trend based on data availability (Fig. S2a, b; Table S2, S3). Moreover, we present the seasonal cycle of streamflow and water level at 8 selected stations across the basin. We find that at all the locations streamflow and water level start increasing from May and peak in August or September (except for Changdu which is peaking in July), following the monsoonal rainfall patterns.
ET. We compare two ET datasets based on GLEAMv3.6b and MOD16A2GF for four seasons, finding that both datasets show similar spatial pattern across the basin (Fig. 2). Upon analyzing the seasonal pattern in both datasets, it is found that summer season had the highest ET, which is consistent with the seasonal precipitation patterns (Fig. 2a-d). Additionally, we observe that the spatial patterns of ET vary depending on the location within the MRB basin. However, with the exception of the spring season, the seasonal and annual MODIS ET is comparatively higher than GLEAM ET (Fig. 2). Finally, our investigation reveals that the mean annual ET for both datasets for the entire MRB basin exhibit similar increasing trend over time (Fig. 2i). Increased ET in the basin may change the percentage of precipitation that becomes surface water runoff or subsurface recharge which may affect the groundwater levels, groundwater surface water interactions, and soil moisture 214,215 .
Surface water. Fig. S3a shows the long-term occurrence of surface water in the MRB based on the data from Ji et al. 128 . We highlight two regions of particular interest: one mainly featuring multiple reservoirs, and the other featuring the TSL and Mekong Delta (Fig. S3). For these two regions we compare the surface water based on JRC and Ji et al. 128 (Fig. S3b-e), finding that the data from Ji et al. 128 show lesser extent, and also lower occurrence especially in the Mekong Delta compared to JRC data. Such surface water datasets are crucial for a wide range of studies in the MRB, including for model evaluation and studies on ecological, agricultural, fisheries, and livelihood changes, especially in relation to upstream dam construction. Many recent studies have used these datasets www.nature.com/scientificdata www.nature.com/scientificdata/ to examine the changing inundation patterns around TSL and Mekong Delta due to climate variability and dam construction 9,29,47 . However, these datasets-mostly satellite based-provide limited information on the changes in surface water such as long-term occurrence or changes in permanent water bodies. Therefore, there are opportunities to develop more accurate and reliable surface water datasets, for example by using information from future satellite missions or improved modeling approaches.
Soil moisture. We chose the soil moisture data from Fang et al. 133 for this study due to its good accuracy and relatively high resolution after reviewing numerous soil moisture products available for the MRB region (discussed in methods section). We analyze the spatial variation of mean annual soil moisture across the MRB and compare downscaled soil moisture data with ground data at 5 locations in Thailand. The mean annual surface (up to 5 cm) soil moisture content in Laos and Vietnam is higher than in other parts of the MRB (Fig. 3a). Mekong Delta in Vietnam and the flood plains in Cambodia show higher soil moisture content. Similarly, the southern parts in the Chinese portion of the MRB, northern Laos, and the subsequent Thailand portions show higher soil moisture. Soil moisture levels are lower in Thailand and some areas of Cambodia that are primarily agricultural (Fig. 3a). A comparison with observed soil moisture at five locations in Thailand suggests that SMAP captures soil moisture content reasonably well (Fig. 3b).
We also analyze the seasonal patterns in the soil moisture data, which reveals that soil moisture is generally higher in the summer and autumn seasons compared to spring and winter (Fig. 3c-f). This pattern is consistent with the typical rainy season in the MRB region, which occurs during the summer and autumn months and results in increased soil moisture levels. We, however, note that this pattern could vary in different regions with different climate patterns that govern seasonal rainfall. www.nature.com/scientificdata www.nature.com/scientificdata/ Groundwater. Groundwater anomalies can be estimated by subtracting the modeled surface water anomalies (e.g., obtained from Global Land Data Assimilation System (GLDAS)) from terrestrial water storage (TWS) anomalies derived from GRACE satellite observations 216,217 . However, the spatial resolution of GRACE data is low, and surface water from GLDAS contains high uncertainty, for example because of missing human interventions. Moreover, observed groundwater datasets for the MRB are not publicly available. Therefore, we present the digitized groundwater data from a series of published literature (Fig. 4). The data consists of temporal measurements of groundwater levels at various locations within the MRB, including daily, monthly, yearly, and Fig. 6 Spatial distribution of (a) area equipped for irrigation expressed as a percentage of total area (grid resolution ~ 10 km) (AEI-PTA), (b) area equipped for irrigation expressed in hectares per cell (AEI-HPC), (c) area actually irrigated expressed as a percentage of area equipped for irrigation (AAI-PAI), (d) area irrigated with groundwater expressed as a percentage of total area equipped for irrigation (AIG-PTI), (e) area irrigated with surface water expressed as a percentage of total area equipped for irrigation (AIS-PTI), (f) area irrigated with water from non-conventional sources expressed as a percentage of total area equipped for irrigation (AIN-PTI).
www.nature.com/scientificdata www.nature.com/scientificdata/ seasonal cycles. The highest density of data points was found in the Mekong Delta region, encompassing parts of Vietnam and Cambodia. General examination of the digitized data reveals declining groundwater levels within the MRB, with the most pronounced decreases occurring in the Mekong Delta (Fig. 4). This decreasing www.nature.com/scientificdata www.nature.com/scientificdata/ trend in groundwater is likely influenced by the high level of groundwater pumping for agricultural purposes in both Vietnam and Cambodia 46,137,146,218 . In addition, the extraction of groundwater for agricultural and domestic use has been linked to subsidence in the Mekong Delta 11,46 . Given these findings, improved management and conservation efforts will be necessary to ensure the sustainable use of groundwater resources in the MRB, particularly in the Mekong Delta region. Dam data. Figure 5 depicts the synthesized and corrected dam datasets (see methods section) for the MRB. A more detailed information for the dams selected from the database for hydrological modeling purpose is created, which includes information on dam height, reservoir storage capacity, and reservoir purpose, among others (Table S4). Here we present selected dam attributes such as dam status, installed capacity, dam height, and reservoir storage capacity (Fig. 5). In the past decade, ~100 dams have been constructed in the MRB 25 , with several more currently being planned or under construction, particularly in China, Laos, and Cambodia. The construction of large dams such as Ru Mei, Guxue, Gushui, and Huangdeng in the UMRB (Fig. 5a) has sparked environmental concerns such as the decline in the flood season river flow and annual sediment flux, and water quality deterioration in reservoirs within China 219 . Additionally, the construction of large dams such as Xayaburi, Nuozhadu, and Don Sahong has led to the trapping of sediment flow and disruption of fisheries, raising significant ecological concerns 1,77,220 . Therefore, the dam database is expected to be useful in hydrological, ecological, and socio-economic modelling and consequently future planning and management.
Land use and crop data. The land use data for the MRB obtained from ESA-CCI were analyzed from 1992 to 2020. Out of eleven land use classes, cropland, tree cover, mosaic tree and shrub, shrubland, and grassland are dominant (Fig. S4). We select two regions in the upper and lower basin for a more in-depth examination (Fig. S4). We found that the LMRB is experiencing a significant increase in crop coverage. In contrast, in the UMRB, there is a substantial increase in tree cover (Fig. S4). In the upper region (region b), there is a slight increase in cropland but a significant increase in tree cover. Tree cover increased primarily from 1996 to 2000, compensating for the loss in shrubland. However, in the lower region (region c), cropland increased substantially with a corresponding decline in tree cover area. Urban areas in region "c" also increased considerably compared to region "b". Overall, cropland areas in the basin rose steadily until 2012, but started declining since then (Fig. S4c).
To gain further insight into crop dynamics in the LMRB, we analyze data on crop yields for rice, maize, bananas, and sugarcane-the major crops grown in the region-for four LMRB countries (Fig. S5). Our analysis revealed that rice yield has been increasing in all countries, particularly after 1990. Vietnam exhibited the highest rice yield among the four countries. Similarly, maize yield has been increasing across all countries, with Laos exhibiting the highest yield. While banana and sugarcane yields have been decreasing in Cambodia, they have been increasing in the remaining three countries. These data, and the interesting patterns therein, could be useful for studies on water and food security issues in the MRB; however, these datasets are available only at the country level, hence cannot be used for basin-scale analyses or modeling. Nevertheless, the datasets www.nature.com/scientificdata www.nature.com/scientificdata/ could be used to derive grid-based products through combination with other datasets such as on cropland areas (e.g., Burbano et al. 76 ).
irrigated area and irrigation water use. The areas equipped for irrigation in the MRB mainly ranges from 0 to 20% of the total grid cell area at 0.083333° (~10 km) spatial resolution (Fig. 6a). The Mekong Delta, the flood plains in Cambodia, Thailand, southern part of UMRB, and some portions in Laos are the main areas that are intensively irrigated (Fig. 6b). Results suggest that Vietnam and North Laos portions are more irrigated compared to Cambodia, Thailand, and Southern portion of Laos (Fig. 6c). Except for some portions in Thailand and the Mekong Delta, which are irrigated with groundwater, rest of the basin is irrigated heavily by surface water (Fig. 6d, e). In addition to the conventional sources of irrigation, the use of non-conventional methods for irrigation is extremely limited, as demonstrated in the data presented in Fig. 6f. As the demand for agricultural products from the LMRB is projected to rise by 20-50% in the coming 30 years due to the growing global population 221 , there is a growing risk of food and water insecurity in the basin. To address this issue, it is important to better understand where irrigation is currently happening, what the implications on water and food systems are, and how future irrigation expansion could affect sustainable water use.
To fill irrigation data gaps for Cambodia, we obtained and examined the data from the Cambodian census database, especially focusing on the spatial patterns of irrigation practices (Fig. 7). Results indicate that the highest agricultural land utilization in Cambodia is located in the proximity of the TSL and in flood plain zones (Fig. 7a). Moreover, a higher density of irrigation infrastructure was observed in the flood plain zones when compared to other regions, which was also found by Park et al. 146 (Fig. 7b). Furthermore, the census data also provided insights on the irrigation systems owned and operated by the government. It can be observed that government-owned irrigation systems were less prevalent in comparison to other irrigation practices such as wells, canals, and open water (Fig. 7c-e). Other than well, canal, and open water irrigation, rest of the irrigation sources are insignificant in MRB (Fig. 7g). The total irrigated area data looks incomplete since it is difficult to reconcile responses here with propHHIrr (many NaN values here) (Fig. 7h). Similar census data for irrigation purposes for the other LMRB countries would help in better understanding and modeling the changes in irrigation water use, however these datasets are currently inaccessible. ecological data. Nutrients and sediment data. We obtained nutrient datasets (DO, NO 3 -N, and TP) from MRC for our study in MRB as mentioned in methods section. The locations of stations for which data on nutrients (specifically DO, NO 3 -N, and TP) are obtained from the MRC are shown in Fig. S6. We only selected locations for DO, NO 3 -N, and TP where the data was available from 1996-2021. Results indicate that except in Cambodia (near the TSL), DO exhibits a declining trend (Fig. S6a). Further, in the TSL region and Mekong Delta the NO 3 -N concentration has been increasing over time (Fig. S6b). However, TP is increasing across the entire basin except in the Mekong Delta (Fig. S6c). Contrary to the common belief, the construction of multiple dams in the upstream of MRB has increased nutrient concentration downstream 52,222 , especially in Cambodia. Wang et al. 223 also showed an increasing trend in total suspended solids in Cambodia between 2000 to 2018. However, DO and TP show a negative trend in the Mekong Delta. Moreover, nutrients in terms of DO and TP tend to have been discharging to inland water bodies (e.g., lakes) whereas the delivery of these nutrients to the Mekong Delta is declining (Fig. S6). www.nature.com/scientificdata www.nature.com/scientificdata/ Similarly, sediment concentration datasets were obtained from the MRC for 18 locations within the MRB (Table S5). However, the datasets are not continuous nor complete for all locations, hence a statistical analysis was not possible. Therefore, we conducted a visual inspection of the data at 10 locations, finding a decline in sediment concentration, particularly at locations within the mainstream Mekong ( Figure S7). Only the Mae Suai dam site and Rasi Salai stations, which are not in the mainstream Mekong, show an increase in sediment concentration; the rest of the stations show a decline (Fig. S7). The reduction in sediment load implies downstream impacts including coastal erosion, reduced nutrient supply for aquatic species and agriculture, and land subsidence 22,224 .
Wetland and inundation data. We provide a detailed comparison between the two selected wetland products: one based on the maps produced by Gumbricht et al. 191 and the other from Tootchi et al. 192 (Fig. 8a,b). Gumbricht et al. 191 classified the wetlands mainly as open water, mangrove, swamps, fens, riverine and lacustrine, floodplains, and marshes based on geomorphology, moisture condition, and vegetation and soil condition. These wetlands are located mostly in the LMRB with the Mekong Delta housing many swamps, mangroves, and floodouts (Fig. 8a). On the other hand, floodplains of Cambodia and around the lake include open water, marshes, and meadows, etc. (Fig. 8a). However, Tootchi et al. 192 classified the wetlands in two parts which are, i) regularly flooded wetlands (RFW) and ii) groundwater-driven wetlands (GDW). Regularly flooded wetlands were produced by taking the combination of three inundation datasets (ESA-CCI, GIEMS-D15, and JRC surface water). However, groundwater-driver wetlands were derived based on Fan et al. 225 groundwater simulations, considering only pixels with water table depth less than 20 cm. Finally, Tootchi et al. 192 proposed composite wetlands (CW), which are the combinations of RFWs and GDWs (Fig. 8b). However, due to difficulty in downscaling the flooding to higher resolution in MRB, maps based on Tootchi et al. 192 contain higher uncertainty. Therefore, SWAMP data developed by Gumbricht et al. 191 could be considered better product for wetland identification in MRB.
Similarly, the GIEMS-D15 (Fluet-Chouinard et al. 193 ) dataset were used to identify the annual inundation areas in MRB (Fig. 8c). We find that, the portions of Mekong Delta and TSL basin and the main streamline of Chi and Mun Rivers, tributaries of the Ngun River, were inundated at least every year (Fig. 8c). However, the entire Mekong Delta, a larger portion of the TSL basin, the basins of the Chi and Mun rivers, and areas www.nature.com/scientificdata www.nature.com/scientificdata/ around Vientiane station were all inundated, considering the annual mean of maximum extent of the flood. Moreover, long-term maximum inundation is almost similar to the mean annual maximum inundation in the MRB. Furthermore, the large uncertainties in the flood inundation data in GIEMS-D15 data propagated from downscaling using DEM data, results in overestimation of the inundation area in MRB (Fig. 8c). Therefore, a basin-wide study at a fine temporal and spatial resolution is essential for the management and conservation of biodiversity and other ecosystem services associated with freshwater.
GHG emission data. The EDGAR based GHG emission datasets cover a substantially long period, enabling a detailed analysis of the spatial patterns of GHG emissions in the MRB. Here, we specifically focus on the trends in GHG emissions over time, with the aim of understanding how these emissions have changed (Fig. 9a). Results indicate a rising trend in GHG emissions throughout the basin. We find a high rate of increase in mean annual GHG emissions in Thailand, the western part of Cambodia (around the TSL), and the Mekong Delta as compared to other regions in the MRB (Fig. 9a). These regions with high GHG emissions having high human population are intensive agricultural regions. Moreover, there is an alarming increase in terms of annual mean GHGs from 1970 to 2018 (Fig. 9b) considering the entire MRB.
Socio-economic data. We compare the projected increase in population and GDP under shared socioeconomic pathways (Fig. 10). Population (2010-2100) and GDP (2030-2100) projections are shown in terms of percentage of the base year as 2000 and 2005, respectively. Projections for the regions of Cambodia, China, Laos, Myanmar, Thailand, and Vietnam, which come within the MRB under shared socioeconomic pathways, were calculated for 5 SSP scenarios which are SSP1 (Sustainability), SSP2 (Middle of the road), SSP3 (Regional rivalry), SSP4 (Inequality), and SSP5 (Fossil-fueled development). In almost all countries, all scenarios show a decrease in the population at the end of the century, except the SSP3 scenario, which shows increasing population in most countries within MRB. Also, the SSP4 and SSP5 scenarios show the most decreasing trend among other. Similarly, there is an increase in the GDP for each country under all five SSP scenarios. Where SSP5 scenario which is projected to have one of the lowest populations shows the highest GDP growth in all countries within MRB. Therefore, population, and GDP are inversely projected in each country of MRB (Fig. 10).

Usage Notes
The datasets synthesized in this study could form the basis for a range of hydrological, agricultural, ecological, and socioeconomic studies in the MRB. For example, the meteorological datasets from EM-Earth provide a probabilistic approach to meet the diverse requirements of hydrometeorological and ecological applications. The observed climate and streamflow datasets can be used in hydrological models to constrain the streamflow data at sparse locations in the basin. Further, groundwater datasets we digitized from the published literature partly fill the complete vacuum in groundwater data for the MRB. Such data are crucial for groundwater modeling in the MRB, which is indispensable to better understand the rapidly evolving groundwater dynamics across the basin. Indeed, groundwater in the MRB remains relatively poorly studied and needs increased attention. The nutrient datasets at various locations in the MRB could be used to improve the understanding of the changes in water quality as well as to constrain and validate model simulations on riverine nutrient budgets, another research direction that has received very little attention, owing primarily to critical data gaps. Spatial and temporal changes in land use land cover are directly linked to the changes in hydrological, agricultural, and ecological systems across the basin. Thus, the land use data could be of use for a range of hydrological, agricultural, and ecological studies. Moreover, population projections can be used in determining the exposure and vulnerability to future hazards. The synthesized gridded GDP projections will help in identifying the vulnerability, exposure, and resilience of socioeconomic activities under future climate extremes. In summary, the datasets synthesized here are expected to fill the widely acknowledged and long-debated data gap for the MRB, which has hindered socio-hydrological studies-including modeling and analysis-toward improving the understanding of rapidly emerging hydrological, agricultural, and ecological systems within the basin, and providing improved future projections for transboundary water management and sustainability.