Comparative Ground Validation of IMERG and TMPA at Variable Spatiotemporal Scales in the Tropical Andes

An initial ground validation of the Integrated Multisatellite Retrievals for GPM (IMERG) Day-1 product from March 2014 to August 2015 is presented for the tropical Andes. IMERG was evaluated along with the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) against 302 quality-controlled rain gauges across Ecuador and Peru. Detection, quantitative estimation statistics, and probability distribution functions are calculated at different spatial (0.18, 0.258) and temporal (1 h, 3 h, daily) scales. Precipitation products are analyzed for hydrometeorologically distinct subregions. Results show that IMERG has a superior detection and quantitative rainfall intensity estimation ability than TMPA, particularly in the high Andes. Despite slightly weaker agreement of mean rainfall fields, IMERG shows better characterization of gauge observations when separating rainfall detection and rainfall rate estimation. At corresponding space–time scales, IMERG shows better estimation of gauge rainfall probability distributions than TMPA. However, IMERG shows no improvement in both rainfall detection and rainfall rate estimation along the dry Peruvian coastline, where major random and systematic errors persist. Further research is required to identify which rainfall intensities aremissed or falsely detected and how errors can be attributed to specific satellite sensor retrievals. The satellite–gauge differencewas associated with the point-area difference in spatial support between gauges and satellite precipitation products, particularly in areas with low and irregular gauge network coverage. Future satellite–gauge evaluations need to identify such locations and investigatemore closely interpixel point-area differences before attributing uncertainties to satellite products.


Introduction
The lack of reliable observations of hydrological variables in the tropics leads to a poor understanding of the hydrological cycle in those regions (Wohl et al. 2012). Especially in mountain regions, such as the tropical Andes, the complex topography results in highly variable spatiotemporal precipitation patterns that are not fully captured by the existing gauge monitoring networks (Ochoa-Tocachi et al. 2016;Rollenbeck and Bendix 2011;Buytaert et al. 2006). In the last decades satellite-based precipitation products (SPPs) have become an alternative source of precipitation estimation with widespread applications such as (distributed) hydrological modeling (Falck et al. 2015;Zulkafli et al. 2014;Jiang et al. 2012;Li et al. 2009), geomorphology and landscape evolution (Nesbitt and Anders 2009;Bookhagen and Strecker 2008), streamflow forecasting (Nikolopoulos et al. 2013;Li et al. 2009), and early warning systems (Tian et al. 2010), as well as investigations into atmospheric processes and storm structures (Boers et al. 2015;Mohr et al. 2014;Boers et al. 2013;Rasmussen et al. 2013;Demaria et al. 2011).
Assessments of SPPs against rain gauge networks in the tropical Andes of Ecuador and Peru have shown a general dependence of SPP performance on rainfall intensity (Mantas et al. 2015), in addition to underestimation of the amplitude of the seasonal cycle as well as underestimation of extreme rainfall intensities (Derin et al. 2016). SPPs have also been shown to systematically overestimate low rainfall intensities (under 5 mm h 21 ), which occur frequently above 2000 m MSL (Derin et al. 2016) and locally represent an important contribution to the total rainfall volume (Padrón et al. 2015). Furthermore, there are regional differences in SPP performance depending on the principal sensor technology and the local rainfall properties as a result of interaction of the synoptic-scale climate processes with the complex topography (Derin et al. 2016;Satgé et al. 2016;Dinku et al. 2010). The Precipitation Estimation from Remotely Sensed Information Using Artificial Neural Networks (PERSIANN; Hsu et al. 1999) product showed large biases and low correlation with rain gauges (Ward et al. 2011;Derin et al. 2016) and was particularly deficient in regions with a significant contribution by deep convective systems to the total rainfall volume (Derin et al. 2016). Comparative studies have shown that the Tropical Rainfall Measuring Mission (TRMM) Multisatellite Precipitation Analysis (TMPA) outperformed other SPPs, including PERSIANN, the Climate Prediction Center (CPC) morphing technique (CMORPH; Joyce et al. 2004), and the Global Satellite Mapping of Precipitation (GSMaP; Kubota et al. 2007), among others, in terms of correlation, rainfall intensity distribution, and bias (Satgé et al. 2016;Derin et al. 2016).
The Global Precipitation Measurement (GPM) Core Observatory, launched on 28 February 2014, has been designed to address critical limitations of TRMM and to further improve the scientific contribution of its predecessor in understanding the global water and energy cycle (Hou et al. 2014). The TRMM Precipitation Radar (PR) detection limit of 17 dBZ (;0.7 mm h 21 ) and the resulting large fraction of missed rainfall was addressed using a dual-frequency radar (DPR). The PR Ku-band frequency of 13.6 GHz is supplemented by a Ka-band frequency at 35.5 GHz for better identification of different phases of precipitation particles and detection of light rainfall to a resolution of 0.2 mm h 21 . The conicalscanning GPM Microwave Imager (GMI) consists of more frequency channels (10-183 GHz) than the TRMM Microwave Imager (TMI) for light-intensity rainfall and snow detection (Hou et al. 2014). Furthermore, satellite revisit time has been reduced from 11-12 h (TRMM satellite) to less than 3 h (GPM), and spatial coverage of the Core Observatory has increased from latitudes 358S-358N to 658S-658N, respectively (Hou et al. 2014). In addition to the Core Observatory, the increasing number of passive microwave sensors within the GPM constellation allows for more frequent sampling and cross calibration of sensors, thus increasing the spatial and temporal resolution of the gridded Integrated Multisatellite Retrievals for GPM (IMERG) precipitation product to 0.18 and 30 min compared with 0.258 and 3 h in TMPA (Huffman et al. 2015a), respectively.
Initial comparative evaluations of IMERG Day-1 and TMPA against rainfall gauges under different climatic and topographic conditions have confirmed the expected improvements of GPM. Prakash et al. (2016) have demonstrated higher estimation accuracy of IM-ERG over TMPA for heavy monsoon-type rainfall at a daily scale, although agreement with gauges was found to be lowest in orography-dominated regions (Prakash et al. 2017). Across China and the Tibetan Plateau, IMERG yields better statistical and hydrological performance than TMPA at (sub) daily scales and exhibits better detection of rainfall intensities than its predecessor, although overestimation of rainfall in dry regions persists Ma et al. 2016;Tang et al. 2016;Li et al. 2016;Chen and Li 2016). Sharifi et al. (2016) found IMERG to perform better statistically than TMPA across a range of topographic and climatic conditions, with rainfall detection being improved in areas with orographic precipitation. Last, IMERG was found to overestimate (underestimate) drizzle (heavy rainfall frequency) compared to radar quantitative precipitation estimation (QPE) over the United States (Tan et al. 2016).
Early experiences indicate the potential improvements of IMERG compared to TMPA; however, performances vary substantially across different topographic and climatic conditions, especially in orography-dominated regions. Furthermore, there is to date a lack of IMERG evaluations in tropical mountain regions that are characterized by complex and variable precipitation patterns with strong orographic effects. Therefore, the objective of this study is to evaluate the performance of IMERG in comparison with TMPA against rain gauges during the first 17 months of GPM observations (from March 2014 to August 2015) to characterize the impact of rainfall regimes, as well as climatic and topographic conditions on IMERG estimates at different spatiotemporal scales. Following a description of the hydrometeorology of the study area, the satellite and rain gauge datasets are outlined (section 2). Section 3 introduces the evaluation methodology with a focus on rainfall detection, QPE, rainfall probability distribution, and spatial correlation. Results are presented for climatically homogeneous regions in section 4 and discussed in section 5. Last, critical findings are summarized in section 6.

a. Study area: Precipitation patterns of the tropical Andes
The study area ( Fig. 1) extends from 28N to 18.58S and from 68.58 to 828W (approximately 1 500 000 km 2 ), covering a climatically diverse region, from northern Ecuador to the central Andean plateau (Altiplano), that is dominated by the tropical Andes, which results in extreme east-west precipitation gradients. Spatiotemporal precipitation patterns are controlled by the biannual migration of the intertropical convergence zone (ITCZ), El Niño-Southern Oscillation (ENSO), and the cold von Humboldt current in the Pacific Ocean, as well as the Andes mountains and the Amazon basin (Boers et al. 2013;Lavado-Casimiro et al. 2012;Vuille et al. 2000). Easterly trade winds resulting from the southerly position of the ITCZ during the monsoon season transport moist air from the tropical Atlantic over the Amazon basin (Boers et al. 2013) and are blocked by the topographic barrier of the tropical Andes (Romatschke and Houze 2010). The deflection of air moisture to the southeast gives rise to the South American low-level jet (SALLJ) that transports air moisture along the eastern Andes into the La Plata basin (Boers et al. 2013). The strong topographic gradients and easterly trade winds along the eastern flanks of the Andes also result in pronounced orographic effects (Espinoza et al. 2015;Espinoza Villar et al. 2009;Bookhagen and Strecker 2008). These, in turn, result in deep convection (Romatschke and Houze 2010) and thereby highly intermittent spatiotemporal precipitation patterns with steep precipitation gradients of up to 190 mm km 21 (Espinoza et al. 2015).
Given its hydrometeorological complexity, the region was divided into six subregions using the classification of Zulkafli et al. (2014), which identifies areas with distinct precipitation regimes based on topography and climate ( Fig. 2). In the current study, the Amazon sub-Andes have been defined as 500-1500 m MSL in order to permit for adequate number of gauges (minimum 10) in each subregion. As shown in Table 1, the subregions differ in their climatic controls, resulting in a range of distinct precipitation regimes. Over the period 1998-2014, the mean annual precipitation varies considerably from the Pacific coast north of 4.58S (PCN; approximately 1450 mm yr 21 ), over the Pacific coast south of 4.58S (PCS; 300 mm yr 21 ), the western Andean slopes (AW, 575 mm yr 21 ), the eastern Andean slopes (AE, 1150 mm yr 21 ), the sub-Andes of the upper Amazon basin (ASA; 3500 mm yr 21 ), and at the Amazon lowlands (AL; 2375 mm yr 21 ) (Manz et al. 2016).

b. Rain gauge data
Subhourly precipitation records were obtained from 302 rainfall gauges ( Fig. 1) (WMO 2014). While minor differences between the providers in terms of preprocessing and temporal interpolation may occur, these are thought to be negligible for the purposes of this evaluation study as the original datasets were aggregated to hourly rainfall accumulations for the assessment period from 1 April 2015 to 31 August 2015. Hourly rain gauge data were quality-controlled using the protocol defined by Shen et al. (2010), consisting of a check for unsupported extremes as well as internal and spatial consistency checks. This resulted in 0.64% of hourly measurements being removed as a result of the extremes check and a further 0.01% as a result of the consistency checks. Finally, 3-hourly and daily average rainfall rates (mm h 21 ) were computed in order to evaluate the satellite precipitation data at different temporal scales.
c. Satellite data: TMPA V7 TMPA version 7 (V7), also known as TRMM 3B42 based on its algorithm names (hereafter TMPA), is a precipitation dataset derived from multiple microwave (MW) and infrared (IR) sensors placed on low-Earthorbit (LEO) satellites. Observations from MW and IR sensors are converted to precipitation estimates and intercalibrated and combined producing real-time (RT) estimates; finally, resulting estimates are adjusted with rain gauge data generating the TMPA ''Research'' version (Huffman et al. 2010(Huffman et al. , 2007. The TRMM satellite started its terminal phase in October 2014 with both the TRMM PR and the TMI having shut down on 8 April 2015; however, the demise of the TRMM satellite itself does not substantially affect the production the TMPA Research version over land (Huffman et al. 2015a). TMPA will be produced until approximately mid-2017 (Huffman et al. 2015a). Hence, TMPA was considered a suitable benchmark for comparative evaluation of IMERG. TMPA at its native resolution of 0.258/3 h was resampled to closed 3-h periods (e.g., 0000-0300 UTC) and for comparison to IMERG.

d. Satellite data: IMERG Day-1
The IMERG product provides high-resolution precipitation estimates by combining various passive microwave (PMW) and IR sensor measurements. This process is described in detail in Huffman et al. (2015b) and is briefly summarized as follows.
Analogous to the TMPA algorithm, the GMI is calibrated to the DPR and the resulting combined instrument (GCI) is used as a calibration standard for other PMW sensors in the GPM constellation. Precipitation estimates are derived from the PMW sensors using the 2014 Goddard Profiling Algorithm (GPROF2014). GPROF2014 relies on external radar and PMW observations for calibrating PMW measurements and is set to be replaced by GPROF2015 in due course, which uses GCI instead. In contrast to TRMM, both DPR and GCI are available in real time, allowing for the same calibrating sensors across all IMERG runs.
All calibrated PMW estimates within the GPM constellation are gridded to 0.18/30 min, prioritizing canonical-scan radiometers over cross-track scanners. Geosynchronous IR (geo-IR) measurements are converted to precipitation estimates using the PERSIANN-Cloud Classification System (CCS) recalibration scheme. Herein regional cloud patch groups are defined and precipitation is assigned to each of these based on an LEO PMW precipitation training set. Next, PMW and IR estimates are combined to create half-hour precipitation estimates using the CMORPH Kalman filter (KF) Lagrangian time interpolation scheme. In this approach, PMW estimates of instantaneous precipitation are propagated from observation to analysis time using cloud motion vectors, which were derived by correlation of spatially lagged consecutive geo-IR images.
Finally, a monthly satellite-gauge combination is performed akin to the approach adopted in TMPA, where multisatellite and gauge fields are combined using inverse error variance weighting. These bias-adjusted estimates are redistributed at the half-hourly scale for the ''Final'' product, which is distributed 2-4 months after measurement (Huffman et al. 2015a). In this evaluation study, the IMERG Day-1 Final run dataset (hereafter referred to as IMERG) is aggregated for the assessment period from its native resolution 0.18/30 min to 0.258/3 h and 0.258/day for evaluation against TMPA.

a. Rainfall estimation: A problem of spatiotemporal scales
The selection of the spatiotemporal scale when evaluating precipitation products is of high importance given the nonlinear structure of precipitation in space and time. The spatiotemporal resolution of rain gauges (point-scale, short-term accumulation) profoundly differs from that of SPPs (gridcell average, temporal average of instantaneous measurements), which in turn vary internally (e.g., TMPA: 0.258/3 h, IMERG: 0.18/1 h), a problem that has received extensive attention (e.g., Wang and Wolff 2010;Villarini et al. 2008;Ciach and Krajewski 1999). Often all gauges within a satellite pixel are averaged and the gauge average is compared to the SPP pixel. With increasing spatial and temporal aggregation these differences become less relevant; however, spatiotemporal aggregation implies the averaging of zero rainfall areas or periods with variable positive rainfall intensities. While rainfall occurrence is often represented  by a binary (Bernoulli) distribution, positively skewed rainfall intensities are frequently modeled as a Gamma distribution, despite being subject to ''heavy tails'' (Tarnavsky et al. 2012;Wilson and Toumi 2005). Averaging zeros and nonzeros combines these into a single statistical distribution with much less variation than the intensity alone. Given these statistical properties, the performance of satellite rainfall products can be expected to improve with increasing spatiotemporal aggregation without any improvements in estimation skill. By corollary, it is important to understand how precipitation estimation deteriorates with increasing resolution. However, this general behavior is further complicated in regions where low and irregular gauge network density is combined with high precipitation variability at the subgrid scale, such as tropical mountain regions. For instance, in the current study, the number of gauges varies substantially across the region ( Fig. 3) with only 9.6% and 24% of all satellite pixels containing more than one gauge at 0.18 and 0.258, respectively. Hence, in order to comprehensively evaluate TMPA and IMERG against rain gauge observations despite the differences in spatiotemporal scale (resolution), three separate spatiotemporal scales were selected for the assessment and higherresolution products were aggregated to the respective scales: 1) 0.18/1 h: This is the highest available spatiotemporal resolution given the IMERG spatial resolution (0.18) and the gauge time step (1 h). IMERG and spatial gauge averages were compared at this scale. 2) 0.258/3 h: This is the TMPA V7 native resolution.
Gauges and IMERG were aggregated and all three products were compared at this resolution. 3) 0.258/1 day: This is a common resolution used of hydrological simulation based on SPPs. All three products were space-time averaged to this scale.
At all scales rainfall products were evaluated as rainfall rates (mm h 21 ). Furthermore, especially at finer spatial scales, there are insufficient gauges available to overcome the statistical impacts of the point-area difference. Therefore, this study further investigates the impact of scale in relation to the number of rain gauges available (section 3c).

b. Evaluation metrics
In the main ground validation, all gauges within the respective satellite pixel are averaged to evaluate the IMERG and TMPA pixels with ground observations of the same spatial support. The evaluation is split into four categories: 1) rainfall detection-empirical rainfall occurrence frequency and detection indicator scores, 2) quantitative errors in the estimation of rainfall intensities, 3) comparison of the cumulative probability distributions of the rainfall intensities, and 4) comparison of spatially averaged subregion time series across the assessment period.
In order to perform ground validation of satellite precipitation data, it is necessary to assume a minimum threshold in the intensities recorded by the rain gauges (Wang et al. 2008). Because of the high frequency of low rainfall intensities in some parts of the tropical Andes (e.g., Padrón et al. 2015), a threshold corresponding to one single record of the highest-resolution gauges available in the network (0.1 mm h 21 ) has been found to be suitable to assess the entire range of rainfall intensities without eliminating the lowest intensities or introducing assumptions as to their distribution. The empirical rainfall occurrence frequency (ROF) is expressed as where i 5 1, . . . , n tot is the number of time steps (total of N tot time steps), p is the precipitation intensity at a particular grid cell, and r is the aggregated spatial resolution (0.18 or 0.258). Rainfall detection is assessed by three categorical error scores: the accuracy index (ACC; Duan et al. 2015), the probability of detection (POD), and the false alarm ratio (FAR): where H is the number of rainfall time steps correctly detected (hits), C is the number of time steps correctly identified as nonraining (correct zeros), M is the number of raining time steps missed by the SPP (misses), F is the number of time steps falsely identified as raining (false alarms), and N sync is the total number of synchronous measurements. ACC represents the fraction of time steps correctly classified (score ranges 0-1, perfect score of 1), POD represents the fraction of rain occurrences correctly detected (score ranges 0-1, perfect score of 1), and FAR represents the fraction of detected rainfall occurrences that were false alarms (score ranges 0-1, perfect score of 0). To evaluate the impact on the rainfall detection scores introduced by the limited satellite rainfall detection ability as well as due to averaging of zero and nonzero rainfall intensities across the spatial support of the SPPs, detection statistics where computed for precipitation thresholds from 0 to 10 mm h 21 . Hereby, rainfall rates of both the SPP and the gauge average ground reference rainfall below the threshold are treated as zero rain. The threshold is then iteratively raised for both the SPP and the ground reference rainfall and the detection scores are recomputed accordingly.
In order not to double count rainfall detection errors, quantitative rainfall rate estimation errors are only computed for time steps where rainfall is accurately detected, as previously proposed by Tang et al. (2015) and Tan et al. (2016). Time steps where rainfall is falsely detected or missed are omitted prior to computing the following statistical metrics: Pearson correlation coefficient (CC), root-mean-square error (RMSE), relative RMSE (rRMSE), and percentage bias (PBIAS): PBIAS 5 where S i and G i are, respectively, the satellite and gauge rainfall intensity at time step i; S and G are their corresponding arithmetic means over the assessment period; and n is the total number of time steps. While RMSE expresses random error in absolute terms (i.e., mm h 21 ) and, therefore, will likely result in elevated errors in wet regions, rRMSE expresses random error relative to the mean precipitation rate. Detection and quantitative estimation errors evaluate whether SPPs and gauges agree at the same time step. While SPP estimation accuracy for individual time steps might be low, SPPs may still be able to characterize the rainfall intensity distribution over the assessment period. For this purpose, the empirical cumulative distribution function [CDF; F(x)] is determined for each SPP and compared to the respective gauge CDF. The CDF can be represented as discrete percentiles, and the ratio of the percentiles across the entire CDF shows how well the SPPs estimate the gauge rainfall distribution: where p is the rainfall intensity corresponding to the percentile j.
c. Impact of spatial scale A fundamental difference between IMERG and TMPA is the improved native resolution (0.18/30 min compared to 0.258/3 h), resulting from an increased number of available satellites, especially PMW sensors, in the GPM Constellation and the approach used for combining them. While the higher-resolution results in a finer definition of precipitation fields and intensity gradients, a high density of rain gauges is required to provide areal mean ground observations to evaluate individual grid cells. As shown in Fig. 3, the gauge network in Ecuador and Peru does not provide high-density coverage: over 90% of gauged IMERG pixels only contain a single reference gauge. This implies that the majority of IMERG pixels that are evaluated in this study may be subject to substantial point-area difference effects (Ciach and Krajewski 1999). The impact of the scale difference between SPPs and corresponding ground-based estimates was investigated using the following analysis: 1) changes in systematic error (PBIAS) using a spatial bootstrap subsampling approach and 2) evaluation of the subgrid scale variability of a single satellite pixel as well as 3) spatial correlation structure of the different rainfall products across the study domain.
In the bootstrap analysis, pixels containing more than one gauge are subsampled by removing a single gauge at a time and spatially averaging the remaining n 2 1 gauges across the pixel. PBIAS is then computed for the pixel using Eq. (8) based on the gauge average and the SPP estimates. The removed gauge is then replaced and the process is repeated n times, resulting in a PBIAS score associated with each removed gauge. The mean of the n PBIAS scores is then computed to obtain a pixelwide gauge average for each pixel. Comparing the PBIAS scores of the bootstrap approach against those of the gauge-averaging approach allows for interpreting the impact of removing individual gauges on the pixelwide gauge average. This offers insight into the dependence of systematic errors in the SG evaluation on the gauge density.
The impact of spatial scale was further analyzed by focusing specifically on the individual 0.258 pixel with the highest number of gauges, hereafter referred to as pixel 1167 (see Fig. 2 for location). This pixel covers the mid-Andean water divide and includes both the west and east Andean slopes. For pixel 1167, scatterplots of synchronous measurements of 1) individual gauges within that pixel, 2) the spatial gauge average across the pixel, and 3) the SPP estimate (IMERG or TMPA) are presented for the different space-time scales to demonstrate how the internal variability at the subgrid scale affects the match of SPP and gauge observations.
Last, the ability of SPPs to capture the geographical structure of precipitation fields is evaluated by comparing the spatial correlation structure between satellite and pixel-average gauge rainfall. The CC is calculated for any pixel pairs across the entire assessment period. The CC results are categorized by the distance between the pixel pairs, averaged across bins of 27 km (which approximately corresponds to 0.258 across the study region) and presented as a spatial correlogram.

a. Mean rainfall fields
The spatial mean precipitation across the assessment period (Fig. 4) shows that IMERG contains an improved definition of the orographically enhanced precipitation hot spots along the eastern Andean flanks and Amazonian sub-Andes in Peru between 78 and 148S compared to TMPA. Furthermore, because of the improved spatial resolution of IMERG (0.18), the steep precipitation gradient from the drier Andean highlands to the orographically enhanced sub-Andes is better defined, while it was previously ''smoothed out'' over a larger spatial distance by TMPA. Isolated intra-Andean valleys, for instance, those between 98 and 128S with elevated precipitation levels, are represented with a far smaller spatial footprint by IMERG than TMPA. In terms of the mean satellite-gauge (S-G) difference at the aggregated 0.258/daily scale (Fig. 4), IMERG and TMPA show a very similar pattern with good S-G agreement (within 2.5 mm day 21 ) at most locations. Both SPPs show a cluster of pixels where the gauge mean is underestimated in the Piura region (northern Peru/southern Ecuador), while for IMERG a smaller cluster of pixels overestimate gauges in the Altiplano region (southeastern Peru). As a result, across the study region TMPA correlates better with mean pixel gauge averages (r TMPA 5 0:80) than IMERG (r IMERG 5 0:76).

b. Subregional temporal patterns
As a second step, the spatial mean precipitation time series of gauge observations, IMERG, and TMPA (at 10-day accumulations) are compared for each subregion (Fig. 5). The results show substantial regional variations. Subregions PCN, AW, AE, and AL show good agreement between gauge observations and IMERG and TMPA. Overestimation of individual rainfall peaks by TMPA, especially in the AE slopes, is notably reduced by IMERG. Underestimation of individual rainfall peaks by both SPPs is observed particularly during the austral winter season (June-August). In the dry PCS region, both IMERG and TMPA systematically overestimate the mean as well as rainfall peaks proportionally to the rainfall magnitude, although overestimation of peaks is stronger by IMERG than TMPA. Subregion ASA differs in that the mean is underestimated substantially by both SPPs. Here rainfall peaks are generally underestimated, although IMERG returns higher estimates than TMPA with occasional overestimation by IMERG. However, this observation should be considered in the context of the low and irregular (and thereby potentially unrepresentative) gauge coverage in ASA.

c. Rainfall detection and occurrence frequency
As shown in Fig. 6, at its native resolution IMERG underestimates both the median and the variability of the ROF observed by the gauges in all climatic subregions except for PCS and AL. With increasing spatiotemporal aggregation, IMERG estimates are elevated, thus leading to better estimation of subregions previously underestimated (PCN, AW and AE), but extensive overestimation in PCS and AL. In contrast to IMERG, TMPA shows systematically lower ROF, resulting in substantial underestimation in most regions, but better estimation of gauge ROF in PCS and AL. At the daily time step, IMERG and TMPA return comparable ROF results, although gauge ROFs are still overestimated for PCS, while the median and variability are underestimated in ASA by both products. Detection scores (Fig. 7) show superior performance of IMERG compared to TMPA in terms of higher POD and ACC and lower FAR for all scales and subregions. At 0.258/3 h IMERG performs substantially better than TMPA, while performance improves with increasing levels of spatiotemporal aggregation for both SPPs thereafter. In general, the detection ability of both SPPs weakens when the rainfall detection threshold is increased, resulting in decreasing POD and increasing FAR. This behavior can be explained in part by the reduction in sample size with increasing rainfall detection threshold: as all rainfall intensities below the threshold are set to zero, the total number of time steps with nonzero rainfall intensity is reduced. Hence, the number of rainfall events at higher thresholds decreases and failure to detect these is given proportionally higher weighting as per Eq. (3), resulting in a decreasing POD. The opposite applies to FAR: with increasing rainfall detection threshold, cases where both SPP and ground reference recorded a low rainfall intensity will no longer be treated as a hit, but as correct zeros, such that with a constant rate of false alarms, the FAR will increase because of the reduction in hits. Similarly, with increasing rainfall thresholds, the number of zero rain events increases, leading to elevated ACC scores, as correct zeros become proportionally more frequent. However, this analysis shows a consistent pattern in that for all three scores and across most regions, IMERG returns systematically higher POD and ACC, but lower FAR scores for all thresholds compared to TMPA at the respective spatiotemporal scale. This suggests an improved rainfall detection ability by IMERG compared to TMPA.
In terms of regional differences, both products show weakest performance (low POD, high FAR) in PCS (Fig. 7). This can be attributed to the infrequent rainfall in this arid region, combined with, at times, elevated levels of humidity that do not result in precipitation. ACC shows very high values in this region, as a result of being dominated by correct zeros, which does not necessarily represent an improvement in terms of rainfall detection. FIG. 5. Comparison of the regional mean precipitation time series at 10-day accumulations over the assessment period for gauges (black), IMERG (red), and TMPA (blue).

d. Rainfall intensity errors
As shown in Fig. 8, the Pearson linear correlation coefficient increases with increasing spatiotemporal scale. Subregions experiencing high precipitation levels generally exhibit higher median correlation results with little spread, whereas in the dry PCS region, there is a wide spread of results. IMERG systematically outperforms TMPA, with improved satellite-gauge correlation being most pronounced in regions subject to high precipitation rates (PCN, ASA, AL) and less so in the arid subregion (PCS). RMSE scores show consistent reduction in random error by IMERG compared to TMPA across all subregions; however, RMSE scores of both SPPs are highest for those subregions experiencing highest levels of precipitation (i.e., ASA and AL). However, in agreement with the other detection and quantitative error statistics, rRMSE is most elevated in dry regions, especially PCS, showing an increase for IMERG over TMPA under these conditions. In terms of systematic error, TMPA overestimates gauge rainfall observations (large positive PBIAS) in the Andean regions (AW and AE) and the PCN. In all three regions, PBIAS for IMERG is substantially reduced, suggesting large contributions of systematic error have been eliminated. Bias is highest in arid conditions with infrequent rainfall (PCS) for both SPPs, while it is generally lowest in the wet Amazonian regions (ASA and AL) where both return comparable results. In contrast to all other subregions, in PCS IMERG did not improve on TMPA with PBIAS for IMERG exceeding that of TMPA at 0.258/daily, thus implying that in dry conditions IMERG estimation accuracy has not improved over TMPA at corresponding space-time scales.

e. Statistical probability distributions of rainfall intensities
With respect to the cumulative probability distribution of estimated precipitation intensities (Fig. 9a), both SPPs overestimate corresponding gauge quantiles across the entire intensity distribution and at all scales, except for IMERG at 0.258/3 h between the 40th and 90th percentiles. However, for identical spatiotemporal scales, IMERG returns a CDF ratio closer to 1.0 than TMPA. At their native resolution, both SPPs overestimate gauge quantiles substantially with a maximum of factor 2.8 (IMERG) and 2.4 (TMPA) at approximately the 30th percentile and decreasing thereafter. Unlike for TMPA, the CDF ratio for IMERG increases above the 90th percentile. As shown in the subregional plots , this increase was found to be isolated to the PCS subregion. This finding agrees with the positive bias for IMERG observed in PBIAS (Fig. 8) and the time series analysis (Fig. 5), which showed strong overestimation of peak rainfall by IMERG in PCS. At the daily aggregation, both SPPs show gradually increasing, positive CDF ratios with a maxima of approximately 1.7 at the 99th percentile. However, for the majority of the cumulative probability distribution and especially below the 40th percentile, the CDF ratio for IMERG is far lower than that of TMPA, suggesting superior ability of IMERG in estimating the gauge rainfall probability distribution. Overestimation of the

2480
gauge quantiles by TMPA is most evident at the TMPA native resolution (0.258/3 h) and most pronounced in ASA, AL, and PCN, that is, under conditions of high frequency and high total rainfall. At this resolution IMERG underestimates gauge rainfall in almost all subregions, especially between the 50th and 80th percentiles, which may be an artifact of spatiotemporal aggregation, as IMERG estimation accuracy is improved both at finer and coarser spatiotemporal scales. However, irrespective of scale, medium to high quantiles are underestimated by IMERG for the high AL and ASA regions.

f. Impact of spatial scale
When comparing the bootstrapping to the standard pixel-gauge averaging, reductions in median percentage bias for IMERG at its native resolution (0.18/1 h) over the Andean and Amazonian subregions (Fig. 10)  increases for both SPPs at all space-time scales. The variations show the average response in the sensitivity of the bias calculation to a removal of only a single gauge at each pixel, highlighting the importance of the coverage and representativeness of the gauge network for SPP evaluation. However, the majority of pixels only contain a single gauge in both the pixel averaging and the bootstrapping approaches. Hence, the difference between the two methods is very low when averaged across all pixels, but will be larger locally, that is, for individual pixels containing more than a single gauge. Focusing on the 0.258 pixel with the highest gauge density, Fig. 11 shows a large variation between the grid SPP estimates and individual gauge estimates within the pixel. While both IMERG and TMPA overestimate the spatial mean gauge rainfall by approximately factor two (IMERG 0.18/1 h: 2.56, IMERG 0.258/3 h: 2.17, TMPA: 2.78), the relationship between an individual gauge and the SPP estimates may even be negative, that is, gauge P46 (from 20.12 to 0.06). This gauge is located on an east-facing slope in a north-south-oriented, medium-elevation (2960 m MSL) valley discharging to the Amazon, whereas most gauges (labeled ''JTU'') are clustered in a high-altitude region (above 4000 m MSL) to the west of the Antisana volcano (southwest of the pixel). For these gauges, the Antisana volcano acts as a barrier, restricting humidity transported by easterly trade winds from the Amazon basin. On the other hand, gauges located at lower elevations farther downstream are directly exposed to the higher humidity levels.
Finally, evaluation of the spatial correlation structure across the entire region (Fig. 12) suggests that, irrespective of the spatiotemporal scale, IMERG and TMPA estimate higher spatial correlation distances than the gauges. This implies the degree of spatial averaging (smoothing) is far higher in the SPPs than the gauge estimates. However, this behavior may also be attributed to the majority of ground-based pixels only containing a single gauge. Thus, these pixels represent point-scale rainfall, which is by nature more variable and has shorter spatial correlation distances than grid average rainfall as reported by the SPPs. Analysis of the spatial correlation of pixels containing at least two gauges as shown in Figs. 12d-f) suggests similar spatial correlation results as the SPPs, thus supporting the notion that observed differences in spatial correlation can be predominantly attributed to the point-area difference in spatial support of the gauges versus the SPPs. Considering only the spatial correlation results for pixels with at least two gauges, IMERG and TMPA still overestimate spatial correlation with the differences between the SPPs smaller than satellite-gauge differences in spatial correlation across all scales.

Discussion
Implications derived from the observations of IMERG performance in this study are summarized and compared to previous studies in Table 2 and discussed in this section. In terms of the comparison of IMERG against TMPA, no improvements are evident with respect to FIG. 11. Analysis of the impact of scale on a single satellite pixel (ID 1167): (a) the topography and gauge network as well as superimposed grids at 0.18 and 0.258 resolution and scatterplots of the individual gauges (and their areal grid mean) against the corresponding SPP estimate for (b) TMPA at 0.258, (c) IMERG at 0.18, and (d) IMERG at 0.258 with best-fit lines using linear regression. The gauge M0188 is not included in the scatterplots as no ''hits'' were observed. mean rainfall across the study period, and TMPA even shows higher S-G correlation. However, separating rainfall detection and rainfall rate estimation accuracy reveals that IMERG has superior skill in estimating both of these precipitation components. The good agreement between TMPA and gauges in terms of the spatial mean rainfall field can be attributed to ROF being underestimated while rainfall intensities were overestimated. Furthermore, irrespective of the regionally differing precipitation regimes, a consistent observation is the reduction in false peaks by IMERG compared to TMPA, suggesting a better estimation accuracy of high rainfall intensities by IMERG. Overall, improvements are most pronounced in the high Andes, which experience a large fraction of low-intensity rainfall (Padrón et al. 2015), suggesting an improved light rainfall detection ability.
Despite improvements in estimating the majority of the rainfall intensity distribution compared to TMPA, IMERG markedly overestimates the high intensities, that is, the highest quantiles of the cumulative probability distribution. This observation has already been reported elsewhere (Sahlu et al. 2016) and is of high relevance, in particular, for applications focused on rainfall or hydrological extremes such as intensity-duration-frequency curves or hydrological flood simulation.
Additionally, IMERG continues to overestimate both the frequency of rainfall as well as rainfall intensities in extremely dry regions, as in the case of PCS in this study, confirming previous findings by Tang et al. (2016), Sharifi et al. (2016), and Guo et al. (2016). In this region, improvements of IMERG over TMPA are lowest by all considered statistics. In particular, the rainfall detection analysis showed high frequency of missed rainfall and falsely detected rainfall. Such false positives can potentially stem from incorrectly transformed infrared retrievals when high cloud cover is falsely translated to rainfall. Similarly, high rates of subcloud evaporation may account for the discrepancy between gauge recorded rainfall and SPP estimates. Quantitative errors were generally highest in the dry PCS region for both TMPA and IMERG, which agrees with previous findings in arid regions of China by Tang et al. (2016).
For all spatiotemporal scales IMERG shows a lower CDF ratio than TMPA, suggesting better estimation of the gauge probability distributions than its predecessor. However, the IMERG CDF ratio increases at high quantiles for all scales, implying that IMERG overestimates the frequency of heavy rainfall, especially in dry arid areas (i.e., subregion PCS). However, it should be noted that direct comparison of quantiles is limited as the sample size of rainfall events differs between regions depending on the rainfall controls and also between gauges and SPPs depending on the SPP estimation accuracy and the rain gauge density. For instance, the high fraction of missed rainfall and false alarms in PCS by IMERG implies sensor and retrieval algorithm limitations under these conditions. Overestimation of high percentiles therefore does not necessarily imply that the frequency of heavy rainfall events is overestimated, but could potentially be a result of proportional underestimation of the bulk of the rainfall intensities (i.e., low quantiles). Furthermore, despite improvements over TMPA in spatially defining high rainfall areas as well as estimating rainfall rates, IMERG fails to accurately capture the high precipitation rates in orographically enhanced regions (i.e., ASA), which is also the region with the lowest coverage of the network. Here, warm clouds that precipitate at temperatures higher than those expected based on ice particle distributions assumed in PMW-based precipitation estimates may result in underestimations of ground-observed rainfall by the SPPs (Dinku et al. 2010). This highlights the importance of further research into the estimation of tropical mountain precipitation by SPPs and the need to increase the number of groundbased stations in data-scarce regions.
This study has also highlighted the importance of the gauge network used for validation of SPPs. For instance, inadequate estimation of precipitation in ASA is likely to be a combined result of limitations in precipitation estimation ability as well as the low and irregular distribution of the local gauge network used for evaluation (see Fig. 1). While random errors may be addressed by statistical methods such as sampling or random simulations, elimination of systematic error (i.e., bias) requires reliable validation data for satellite-gauge adjustment (Tan et al. 2016). The unrepresentative coverage of rain gauge networks is a well-established and somewhat fixed boundary condition of hydrometeorological studies and a principal motivation for extensive research into SPPs. However, the impacts of insufficient gauges and the resulting point-area differences continue to affect satellite-gauge evaluations.
An example of this is illustrated by the analysis of the individual pixel 1167 (Fig. 11). Here overestimation of the spatial gauge mean by the SPP may stem from the gauges preferentially sampling the low rainfall area in the southwest, leading to an underestimation of the true spatial mean of the area within the pixel. Similarly, the negative relationship between gauge P46 and the SPPs illustrates the impact of the point-area difference: while subgrid-scale variability in precipitation patterns is not captured by the SPP, it may strongly impact individual gauges. Disagreements between SPPs and gauge estimates therefore do not necessarily imply satellite retrieval errors but can similarly stem from low density or nonuniform gauge coverage. While restricting satellitegauge comparisons to pixels with at least three gauges would eliminate vast regions worldwide and hamper assessment of SPPs specifically in poorly gauged regions, the representativeness of gauge networks needs to be considered when S-G agreement is poor.

Conclusions
This study performed a comparative ground validation of IMERG and TMPA against a network of 302 - Tang et al. (2016) rain gauges in Ecuador and Peru over a 17-month period from April 2014 to August 2015. The region is influenced by a range of climatic drivers, resulting in substantial differences in spatiotemporal precipitation patterns from the dry Peruvian coast across the tropical Andes to the Amazonian lowlands. To comprehensively evaluate the SPPs, the study area was divided into six hydrometeorological subregions and rainfall occurrence frequency, detection, quantitative estimation errors, and accuracy of empirical cumulative distribution probabilities were assessed. Despite similar precipitation means over the assessment period, IMERG outperformed TMPA in most validation statistics, demonstrating lower errors in detection and quantitative rainfall rate estimation as well as a higher accuracy in estimating occurrence frequency and rainfall intensity distributions. For both products, performance increased with increasing spatiotemporal scale because of the reduction of the space-time variability of rainfall patterns. Advances to sensor technology and retrieval algorithms of DPR and GMI have resulted in improved detection of low rainfall intensities (,0.7 mm h 21 ) and higher accuracy in estimating medium and high rainfall intensities, with consequential improvements in terms of defining the statistical probability distribution of rainfall. Improvements of IMERG over TMPA are geographically most pronounced in the high Andes (AW and AE), which confirms the promising results from previous studies with respect to the potential of IMERG in high-altitude regions. Substantial errors in terms of overestimating the frequency of rainfall as well as positive bias and largest random errors persist along the dry Peruvian coastline (PCS), where IMERG does not show improvements over TMPA. Rainfall patterns characterized by infrequent rainfall events and very low mean annual precipitation totals in this subregion result in a high fraction of missed rainfall and falsely estimated rainfall, especially for high rainfall intensities. On the other hand, IMERG shows improvement in spatially defining and quantifying orographically enhanced precipitation hot spots in the Amazonian sub-Andes, although absolute gauge rainfall totals are underestimated.
The study has also highlighted the importance of utilizing a well-developed gauge network with spatiotemporally representative coverage for evaluating SPPs. While the premise of using SPPs lies in their ability to provide rainfall estimates in poorly gauged regions, assessing SPPs in these regions is complicated by the very fact that the gauge network does not allow for comparison with equal spatial support, potentially resulting in substantial point-area differences between gauge and satellite estimates. Further research should therefore focus on developing metrics to evaluate the representativeness of gauge networks as well as understanding limitations and associated uncertainty in their estimation accuracy of true rainfall, for example, employing geostatistics to quantify gauge interpolation and S-G merging uncertainties (e.g., Delrieu et al. 2014) as well as S-G error frameworks that account for the nonlinear structure of rainfall and how this manifests itself in rainfall detection and rate estimation errors (Tan et al. 2016;Maggioni et al. 2014).
Further evaluation of IMERG in the tropical Andes should investigate as to what empirical conditional probability distribution is associated with each rainfall detection category (hits, misses, false alarms) and how this varies seasonally and regionally. Investigations of singlesensor (level 2) products against ground observations may help identify the performance of individual sensor retrievals (DPR, GMI, IR retrievals), and thereby support in attributing errors in the combined IMERG product (Tan et al. 2016). While the results presented here suggest IMERG has already achieved robust improvements in estimating precipitation in the tropical Andes, these should be seen in the context of the short assessment period (17 months) that overlapped with the onset of an El Niño event in mid-2015. Interannual rainfall variability in Ecuador and Peru is strongly affected by ENSO variations. Therefore, evaluation analysis over longer time periods (e.g., TRMM era retrospective since 1998) will help ascertain the influence of interannual precipitation variability and the impacts of changes in the IMERG and TMPA calibration algorithms. For the Ecuador-Peru region, comparative analysis of IMERG versions 3 and 4 as well as distributed hydrological model simulations based on IMERG products are currently being developed. datasets were provided by the NASA Goddard Space Flight Center's PMM and PPS, which develop and compute IMERG and TMPA as a contribution to GPM and TRMM, respectively.