ACIX-Aqua: A global assessment of atmospheric correction methods for Landsat-8 and Sentinel-2 over lakes, rivers, and coastal waters Remote Sensing of Environment

available for Landsat-8 and Sentinel-2 data processing. Over 1000 radiometric matchups from both freshwaters (rivers, lakes, reservoirs) and coastal waters were utilized to examine the quality of derived aquatic reflectances ( ̂ ρ w ). This dataset originated from two sources: Data gathered from the international scientific community (henceforth called Community Validation Database, CVD), which captured predominantly inland water observations, and the Ocean Color component of AERONET measurements (AERONET-OC), representing primarily coastal ocean environments. This volume of data permitted the evaluation of the AC processors individually (using all the matchups) and comparatively (across seven different Optical Water Types, OWTs) using common matchups. We found that the performance of the AC processors differed for CVD and AERONET-OC matchups, likely reflecting inherent variability in aquatic and atmospheric properties between the two datasets. For the former, the median errors in ̂ ρ w ( 560 ) and ̂ ρ w ( 664 ) were found to range from 20 to 30% for best-performing processors. Using the AERONET-OC matchups, our performance assessments showed that median errors within the 15 – 30% range in these spectral bands may be achieved. The largest uncertainties were associated with the blue bands (25 to 60%) for best-performing processors considering both CVD and AERONET-OC assessments. We further assessed uncertainty propagation to the downstream products such as near-surface concentration of chlorophyll- a (Chl a ) and Total Suspended Solids (TSS). Using satellite matchups from the CVD along with in situ Chl a and TSS, we found that 20 – 30% uncertainties in ̂ ρ w ( 490 ≤ λ ≤ 743 nm ) yielded 25 – 70% uncertainties in derived Chl a and TSS products for top- performing AC processors. We summarize our results using performance matrices guiding the satellite user community through the OWT-specific relative performance of AC processors. Our analysis stresses the need for better representation of aerosols, particularly absorbing ones, and improvements in corrections for sky- (or sun-) glint and adjacency effects, in order to achieve higher quality downstream products in


Introduction
Compensation for atmospheric scattering and absorption and for surface reflection at the air-water interface (i.e., sky-glint and sun-glint) from the signal measured at the Top of Atmosphere (TOA) is referred to as the process of atmospheric correction (AC). Robust AC is essential for the accurate retrieval of aquatic reflectance and downstream science products (e.g., near-surface concentration of chlorophyll-a; Chla, and Total Suspended Solids; TSS) from remotely sensed observations. AC over the open ocean is carried out adequately, as reported by the International Ocean Color Coordinating Group (IOCCG) (IOCCG, 2010), but over inland and coastal waters inaccurate AC still leads to large uncertainties in satellite data products, thus limiting the detection of subtle variability in aquatic ecosystems . As a result, some satellite-based methods for the detection of harmful algal blooms (HABs), for instance, rely on Level 1 TOA or simple Rayleighcorrected quantities in order to avoid large uncertainties in Level 2 products introduced by poor AC performance over eutrophic waters Matthews and Bernard, 2013;Schaeffer et al., 2018;Stumpf et al., 2016).
The availability of image data at resolutions on the order of tens of meters, such as those acquired by the joint National Aeronautics and Space Administration (NASA) and U.S. Geological Survey (USGS) Landsat program and the Copernicus Sentinel-2 Services, has spurred development of applications for smaller water bodies such as lakes, rivers and estuaries. This has led to the development of a number of novel AC processors to obtain accurate satellite-derived aquatic reflectances (ρ w ) for downstream products (e.g., Chla). These processors show significant differences in methods and previous ad hoc comparisons were limited in geographic scope, or in the number or types of matchups (i.e., coastal versus freshwater), and did not often adhere to an identical matchup analysis approach De Keukelaere et al., 2018;Ilori et al., 2019;Pahlevan et al., 2017b;Pahlevan et al., 2017c;Pereira-Sandoval et al., 2019;Renosh et al., 2020;Soomets et al., 2020;Vanhellemont, 2019;Warren et al., 2019). Thus, important questions remain unaddressed, for example, whether processors meet the currently defined 30% threshold requirements (Global Climate Observing System; GCOS) across all bands, how they compare to heritage approaches (e.g., Franz et al., 2015), and specifying which processor(s) can provide more reliable downstream products given the uncertainties in ρ w under various atmospheric conditions and/or across distinct aquatic ecosystems. A list of widely used notations and acronyms throughout this article is provided in Table 1.
To address important questions relating to atmospheric correction, NASA and the European Space Agency (ESA) jointly conducted an Atmospheric Correction Intercomparison eXercise (ACIX) in coordination with the Committee on Earth Observation Satellites (CEOS) validation group. The initial exercise (ACIX-I; Doxani et al. (2018)) focused on diverse land-cover types and continental/coastal atmospheric conditions and mostly evaluated aerosol optical thickness (AOT) retrieval to gauge performances. The scope of the first exercise was however limited for aquatic studies due to the absence of assessments of ρ w retrieval accuracy. Therefore, a second Atmospheric Correction Intercomparison eXercise (ACIX-II), hereafter referred to as ACIX-Aqua, commenced in October 2018. To fulfill the objective of ACIX-Aqua, a large number of high-quality observations from a representative range of aquatic environments were desirable. Automated and systematically processed radiometric observations through the Ocean Color component of the Aerosol Robotic Network (AERONET-OC; Zibordi et al., 2009a;Zibordi et al., 2006) are an invaluable asset for such analyses (Hlaing et al., 2013;Jamet et al., 2011;Mélin et al., 2010). They mostly represent coastal ecosystems, but also include a number of freshwater sites (Philipson et al., 2016). To complement the AERONET-OC database and to add more data representing freshwater ecosystems, a communitywide data sharing initiative was undertaken. The goal was to include optically diverse water bodies where, for example, AC may be more challenging due to environmental factors, such as the presence of absorbing aerosols or land adjacency effects (AE) (IOCCG, 2018). This second data set is henceforth referred to as the Community Validation Database (CVD).
The purpose of this article is to provide an objective assessment of state-of-the-art atmospheric correction using a global dataset representing a wide array of atmospheric and aquatic conditions. We evaluated eight different AC processors applied to Landsat-8 Operational Land Imager (OLI) data and Sentinel-2A/B MultiSpectal Instrument (MSI) images over inland and coastal waters. Owing to their inherent differences in measurement techniques and representations of aquatic environments, the AERONET-OC and CVD matchups were primarily treated independently. Additionally, our entire dataset (AERONET-OC and CVD aggregated) was divided into Optical Water Types (OWTs, Section 3.4), which allowed an assessment of processors across widely variable coastal and inland water conditions. Furthermore, we aimed to investigate how uncertainties in ρ w manifest in satellite retrieved Chla and TSS for each processor. To that end, we provide a brief overview of atmospheric correction followed by a description of the in situ data employed in the analysis. Subsequently, succinct descriptions of the AC processors, matchup selection, OWT classification, as well as Chla and TSS retrieval algorithms, and statistical metrics used for the performance assessments are presented. The results and their implications for scientific studies and monitoring applications are further elaborated upon in Sections 4 and 5. Finally, we offer recommendations (Section 6) on the viability of the AC processors for freshwater and coastal ecosystems with due attention to the quality of downstream products, such as Chla and TSS.

Background: Atmospheric correction
During the early ocean color era, prior to the launch of the Coastal Zone Color Scanner (CZCS), radiative transfer analyses had shown that most of the "blue" photons reaching TOA over the ocean arise from within the atmosphere (Gordon, 1976) through the scattering and absorption processes induced by gas molecules and aerosols. The purpose of AC for satellite observations is, therefore, to compensate for the photons that do not originate from the water column. The goal is to estimate non-dimensional aquatic reflectance ρ w ,which may also be denoted as water-leaving reflectance in the literature (Ruddick et al., 2006). In the absence of sun-glint (i.e., direct solar beam specular reflection), surface whitecaps and AE, the total signal expressed as TOA reflectance, ρ t (λ), can be simplified as where t is the diffuse transmission (Gordon and Wang, 1994), ρ r is the Rayleigh reflectance in the absence of aerosol, ρ a is the aerosol reflectance, and ρ ar is the radiance arising from Rayleigh-aerosol multiplescattering. Depending on the methodology adopted for AC, the three components within the brackets are computed either as one unknown parameter, i.e., ρ path , or by separating the Rayleigh and aerosol contributions (Antoine and Morel, 1999;Deschamps et al., 1983;Gordon, 1978). The diffuse transmission is also computed knowing the transmissions due to air and gas molecules as well as aerosols (Yang and Gordon, 1997). Among all the unknown components, estimating the aerosol contribution is the most intractable undertaking, and small errors may lead to high uncertainties in ρ w (Gordon and Wang, 1994). Several approaches have been proposed for the removal of atmospheric effects over open oceans for missions like the Moderate Resolution Imaging Spectroradiometer (MODIS), Medium-Spectral Resolution Imaging Spectrometer (MERIS), Global Imager (GLI), and POLarization and Directionality of the Earth's Reflectances (POLDER) (IOCCG, 2010). These approaches differ primarily in the methodology used for the removal of the aerosol contribution (Antoine and Morel, 1999;Chomko and Gordon, 1998;Fukushima et al., 1998;Gao et al., 2000;Gordon, 1997;Gordon et al., 1997;Gordon and Wang, 1994;Harmel and Chami, 2011;Nicolas et al., 2002). This body of research collectively asserts that the performance of the existing AC processors are, in general, acceptable over clear ocean waters, where water-leaving radiance in the near-infrared (NIR) is negligible and maritime (nonabsorbing) aerosols are the dominant aerosol type (IOCCG, 2010).
In both inland and coastal waters, water-leaving radiance in the NIR is often not negligible, and AE due to multiple-scattering from neighboring terrestrial terrain can contribute to ρ path (Bulgarelli et al., 2014;Santer and Schmechtig, 2000;Sterckx et al., 2015). Moses et al. (2017) reported how the removal of atmospheric effects over inland and coastal waters might be further complicated by other factors, such as the proximity to terrestrial sources of aerosols, which result in an optically heterogeneous atmosphere. In coastal waters, terrestrial-and marinesource aerosols can also produce spatially variable and mixed conditions (Pahlevan et al., 2017a) that may not be fully represented in existing aerosol models (Ahmad et al., 2010). Some variants of the open-ocean algorithms attempt to account for non-negligible water-leaving radiances in the NIR in inland and coastal waters (Bailey et al., 2010;Moore et al., 1999;Siegel et al., 2000;Stumpf, 2004). Alternatively, the use of the short-wave infrared (SWIR) bands has been demonstrated to improve retrievals in sediment-dominated waters (Gao et al., 2000;Vanhellemont and Ruddick, 2014;Wang, 2007;Wang and Shi, 2007). However, AC over inland and coastal waters is still an important area of research and development.

Dataset
The radiometric quantity commonly utilized in remote sensing studies is the in situ aquatic reflectance, ρ w , defined as where L w (0 + ) and E d (0 + ) are the water-leaving radiance and downwelling irradiance just above the water surface, respectively, and R rs is remote-sensing reflectance (Mobley 1999). To avoid confusion with nomenclature used in the terrestrial remote sensing community, we adopt the term "aquatic reflectance" for ρ w .
The standard AERONET-OC product (Level 2), the normalized L w (0 + ) corrected for bidirectional effects (Morel et al., 2002), was divided by the solar irradiance spectrum (Thuillier et al., 2003) resampled by 11 nm square filters (Zibordi et al., 2006), and then multiplied by π to yield ρ w . The initial pool of matchups for both missions included > 1200 samples. To account for the difference in the spectral sampling of AERONET-OC and ρ w , we applied the deep neural network approach 1 proposed in Pahlevan et al. (2017d), which converts AERONET-OC ρ w to OLI (443,482,561,and 655 nm) and MSI (443,490,560,and 664 nm) broadband observations. Throughout this research, however, we use MSI band centers to refer to these four visible bands. The spectral band adjustments amounted to <15% change in the magnitude of ρ w at the 490, 560, and 664 nm bands, subject to the shape and magnitude of the spectra (Fig. 3). Note that only <17% of our final AERONET-OC dataset corresponded to inland waters; thus, the associated analysis is primarily representative of coastal regions (Fig. S1).
In order to expand the representation of inland watersinland waters, a total of 2679 hyperspectral ρ w records were compiled in the CVD.
These spectra had been acquired using field radiometers that were assembled and calibrated by six different manufacturers. These instruments were utilized for field measurements following 10 different techniques, all of which are listed in Table S1 and Fig. S2. A visual inspection of ρ w was carried out to detect data showing abnormal spectral features (e.g., noisy spectra, negative values in any spectral bands). This led to the exclusion of less <1% of the initial database. Such a small percentage of excluded spectra was a result of the data quality screening that had already been undertaken by the data providers. No further adjustments (e.g., corrections for the Bidirectional Reflectance Distribution Function; Hlaing et al., 2012) were applied to this dataset. The in situ radiometric data were provided at various spectral resolutions (between 1 and 3.3 nm) and ranges (mostly within the 350-900 nm range). The spectra were convolved with the relative spectral responses of OLI and MSI to obtain band-equivalent ρ w . Here, MSI NIR bands at 705, 743, and 783 nm bands were also simulated. Overall, < 10% of the entire CVD were measured in coastal (e.g., near Crete, Greece) or brackish waters (e.g., Baltic Sea), and the rest represent a diverse range of freshwater ecosystems, from hypereutrophic lakes (e.g., Lake Taihu, China), to oligotrophic lakes (Lake Garda, Italy), and to rivers (e.g., Japurá River, Brazil) ( Table S1). The CVD is thus a significant complement to AERONET-OC. The continental distributions of valid OLI and MSI matchups, following the implementation of the matchup criteria (Section 3.3) are illustrated in Fig. 1.
The assessment of the effect of AC on the quality of downstream products requires matchups accompanied by in situ measurements of water constituents. The AERONET-OC data do not include measurements of Chla and TSS while the CVD has some direct measurements with 123 matchups (Fig. 2). Therefore, for the CVD, we primarily focused on a quantitative analysis (Section 4.2.1) using this subset of Chla and TSS matchups and compared them against Chla and TSS estimated from ρ w (hereafter referred to as Chla r and TSS r ) via best-practice retrieval algorithms (Section 3.5). For the AERONET-OC data, we applied select retrieval algorithms to ρ w to estimate Chla and TSS, hereafter referred to as pseudo in situ estimates (Chla p and TSS p ) (Section 4.2.2). For an ideal AC processor, Chla p and Chla r , for instance, are expected to agree if produced via the same algorithm. It is therefore plausible to assume that the differences in products are attributable to Fig. 1. Locations of valid in situ radiometric matchups acquired near-coincident with Landsat-8 and Sentinel-2 overpasses (see Section 3.3 for more details). These matchups correspond to diverse aquatic ecosystems, including lakes, rivers, and coastal waters (see Tables S1 and S3). The Community Validation Database (CVD) contains data mostly representing inland waters. Background map source: https://www.shadedrelief.com/  Dark target and AOT multi-parameter inversion (area-based) Rayleigh LUT 6SV (Kotchenova et al., 2006) OSOAA (Chami et al., 2015) OSOAA (Chami et al., 2015) SOS (Lenoble et al., 2007) Ahmad and Fraser, 1982 MODTRAN 5.0 (Berk et al., 2006) Other Considerations Geometry Scene center for OLI and 5-km grids for MSI Per-pixel multi-scale Per-pixel Per-pixel Per-pixel Aerosol model Continental/maritime aerosols (Kotchenova et al., 2006) Mixture of fine and coarse modes (Harmel and Chami, 2011) (Shettle and Fenn, 1979) ( Moulin et al., 2001) NA Coastal/Ocean (Ahmad et al., 2010) MODTRAN rural models (Berk et al., 2006) Cloud masking  (Brockmann et al., 2016) Yes (Fan et al., 2021) (continued on next page) individual processors and how they influence ρ w in spectral bands contributing to Chla or TSS retrievals. Moreover, this assessment uses all the valid matchups (N > 400; Section 3.3). The analysis also demonstrates the sensitivity of retrieval algorithms (Table 4) to uncertainties associated with each AC processor, providing insights into the choice of retrieval algorithm for a respective processor. For completeness, this sensitivity assessment was also repeated for the CVD matchups (Appendix C).

AC processors
Eight AC methods were evaluated (Table 2). According to their underlying mechanisms, the processors fall into two broad categories, namely two-step and machine-learning schemes. In the two-step procedure, the effects of Rayleigh and gaseous absorption are first removed and then aerosol contribution is approximated (Eq. 1). ACOLITE and iCOR are the two processors that follow heritage terrestrial approaches for removing aerosol contributions (Vermote et al., 1997) while SeaDAS applies the heritage ocean color approach (Mobley et al., 2016).
POLYMER, on the other hand, fits a second-order polynomial function to ρ rc to simultaneously correct for aerosol and sun-glint signals.
Similarly, GRS applies a spectral fitting approach to the observed ρ path signal to approximate aerosol radiance. The two machine-learning models, C2X and OC-SMART, are both based on multilayer perceptron neural networks, trained with synthetic datasets that were generated using in-water and atmospheric radiative transfer models. MEETC2 utilizes a Bayesian Predictive Classification (BPC) method using Gaussian Mixture Model prior distributions on ρ rc and dissociates the effects of hydrosols and aerosols. Further details on the processors and relevant citations are provided in Table 2.

Matchup selection
While each processor has specific exclusion criteria to mask out clouds and/or haze and handle AE or sun-glint, we further employed an additional masking strategy to retain or remove matchups that could better inform our analysis. Prior to the matchup filtering, all potential matchups were inspected for outliers where either no valid retrievals were derived (e.g., due to clouds) or major disparities between ρ w and ρ w were identified. If the difference between ρ w and ρ w in any band for at least four processors was >100%, the retrieval was flagged as an outlier and the matchup was subsequently discarded from the entire assessment. These two exclusion criteria eliminated 1392 CVD matchups leaving a total of 1287 samples. Among the excluded measurements were in situ measurements in small lakes (e.g., Methods #6 and #9 in Table S1), close to shorelines, or in proximity of man-made objects where AE may have impeded a reliable AC. A similar preliminary cloudmasking led to ~ 600 matchups for the AERONET-OC dataset.
For image data, ρ w were extracted from within 150 m × 150 m square boxes surrounding the matchup location. This is equivalent to 5 × 5-element and 15 × 15-element windows for OLI and MSI, respectively, in the visible bands. For the NIR bands of MSI, this window size represents a 7 × 7-element window (Drusch et al., 2012). An area of approximately 90 m × 90 m centered around the AERONET-OC sites was discarded to minimize contamination from the respective platforms (e. g., lighthouses), where instruments are deployed. Slight differences in the window sizes between OLI and MSI matchups were assumed to introduce negligible uncertainties. The median value of the remaining valid pixels was chosen to best represent ρ w . Additional (conservative) masking criteria (Eq. 3 & 4) were developed (Pahlevan et al., 2014) and modified via experimenting with the available in situ data and visual inspection: where NIR in is the line height at the 865 nm band computed using ρ t at 664 and 1609 nm (Abbott and Letelier, 1999). This index was defined to identify intense phytoplankton blooms and avoid their masking (e.g., Lake Taihu images). Std [ρ t (1610)] was evaluated over the 150 m × 150 m matchup areas to evaluate local spatial variability. For instance, clouds and cloud shadows are expected to result in high standard deviations. The acceptable time difference for matchups was limited to +/− 30 h (Warren et al., 2019) for the CVD matchups and to +/− 3 h for AERONET-OC matchups (Bailey and Werdell, 2006).
Retrievals exhibiting negative values in any band were labeled as failure. Because of frequent negative retrievals in the NIR at the AERONET-OC sites, this criterion was applied only in the visible bands for the respective matchups. For the CVD matchups, this criterion was implemented independently for the visible and NIR bands (see 4.1). Fig. 3 shows the frequency distribution of ρ w (664) derived with POLY-MER, which yielded the largest number of valid matchups for both groups of data (N = 1243; 516 and 727 for the AERONET-OC and CVD matchups, respectively). The number of valid matchups aggregated from both OLI and MSI images is provided for each processor in Table 3. The difference in the number of matchups stems from each processor adopting different theoretical assumptions and applying various thresholding schemes, or practical considerations, for retrievals. The detailed matchup statistics for other processors are also included in Table S3.

Optical water types
To enable a performance analysis per OWT, a classification scheme and set of reference OWTs were chosen. For the classification method, we used the spectral angle mapping technique that takes input spectra normalized by the area under the curve within the range of 400 to 750 nm (Yuhas et al., 1992). A given spectrum is assigned to one of the reference OWTs with which it constructs the minimum cosine angle. Among the various OWT references (Eleveld et al., 2017;Moore et al., 2014;Spyrakos et al., 2018), we found that the specific six and seven OWTs recommended in Eleveld et al. (2017) and Moore et al. (2014), respectively, do not fully encompass the optical variability found across coastal waters. As a result, the 21 OWTs of Spyrakos et al. (2018) were initially considered. The first assignment of ρ w to these OWTs for our valid matchups (Table 3) led to under-represented OWTs that would render the interpretation of the results across a continuum of aquatic environments a challenge. Following a few experimental analyses by adding and removing classes from the original OWT set, a subset of seven OWTs was found to provide both a near-uniform distribution of    (Spyrakos et al., 2018). The number of matchups per OWT for POLYMER is also depicted (Fig. 3). See Table S2 for the correspondence of our OWTs with the original set (Spyrakos et al., 2018). For a detailed description of OWTs see Table 4.
OWTs for each processor's valid matchups and a fair coverage for various OWTs encountered in inland and coastal waters. The correspondence of our selected OWTs with the original set (Spyrakos et al., 2018) is provided in Table S2. Fig. 4. illustrates our selected seven OWTs and the distribution of OWTs, as an example, for the valid matchups of POLYMER. OWTs 1 and 2 are commonly found in the coastal waters and/or oligotrophic lakes whereas OWT3 is attributed to moderately eutrophic waters. Lakes or coastal estuaries with various degrees of phytoplankton blooms are represented by OWTs 4, 5, and 6. Lastly, OWT7 ensures that sediment-rich waters are present in our spectral library. Table 4 provides more specific statistics for the water constituents per OWT. For the OWT-specific evaluation of AC processors (Section 4.1.2), AERONET-OC and CVD matchups were aggregated to permit a statistically robust inference in each OWT. To enable direct intercomparisons among the AC processors for each OWT, we pursued two approaches to identify common matchups: a) a pairwise matchup identification, and b) a universal matchup analysis. The pairwise matchups are determined between two AC processors while universal matchups are determined among multiple processors. The results associated with the universal matchup approach are illustrated in Appendix B (Figs. B1 and B2), but the number of matchups per OWT was considered insufficient. Therefore, we adopted the pairwise intercomparison as the primary method for ranking the AC processors for which CVD and AERONET-OC matchups were combined. The best-performing AC processor per OWT and spectral band was then ascertained by computing win rates (Section 3.6).

Water constituent retrieval
To examine how uncertainties in ρ w propagate to Chla r and TSS r products, two families of algorithms, namely Type I and Type II, were implemented. Type I algorithms are those that utilize the visible bands for retrievals, whereas Type II are those that require both visible and NIR, or only NIR, bands. Type I algorithms were applied to the combined OLI and MSI matchups to enhance the statistical significance of the analysis. For this category, widely used empirical algorithms, including Nechad (Nechad et al., 2010) for TSS and OCx (O'Reilly and Werdell, 2019) for Chla, were utilized (Table 5). For Chla assessments, the recently developed Mixture Density Networks (MDN) model was also employed Smith et al., 2021), while SOLID (Balasubramanian et al., 2020) and Novoa (Novoa et al., 2017) models were also applied for TSS analyses. While the MDN model uses all the available bands across the visible and NIR region, it is sensitive to the most relevant spectral bands (e.g., 490 and 560 nm; Smith et al., 2021). Note that the switching schemes available for SOLID and Novoa in sediment-rich waters were disabled as only OLI-MSI visible bands were utilized. Type II algorithms are only applicable for Chla retrievals from MSI matchups for which 13 different algorithms, including MDN and variations of three-band models (e.g., Moses et al. (2012)), were implemented. For conciseness, outcomes of only the five top-performing models are presented ( Table 7). The formulations for the retrieval algorithms listed in Table 5 are provided in the Equations section of the supplementary material.

Performance metrics
We mainly focused on investigating two metrics representing the overall error and bias in target quantities. These two norms first transform data into log space and then convert them back to linear space to assess the quality of the retrieved quantity (e.g., ρ w ) against that measured in situ (e.g., ρ w ). These metrics are computed as follows: Table 4 Statistics for water constituents associated with each OWT derived by taking weighted averages of data in Spyrakos et al. (2018) where λ i refers to spectral band i, Median is the median operator, β is the symmetric signed percentage bias and ϵ represents the median symmetric accuracy (MdSA; Morley et al. (2018)). These metrics, expressed in %, are simple for interpretation, reasonably resistant to outliers, and zero-centered compared to those in Seegers et al. (2018). In the definition of the above metrics, we assume no uncertainties in insitu measurements, because to a large extent these uncertainties affect all the AC processors equally and were unknown at the time of this research (see Section 5.1). The quantities ρ w and ρ w may be replaced by other quantities, such as Chla r (or TSS r ) and Chla (or TSS), respectively. Where Chla p and TSS p are treated as reference products, the estimated error and bias metrics are denoted ε and β (Section 4.2.).
We also computed the Root Mean Squared Log Error (RMSLE), Root Mean Squared Error (RMSE), Median Absolute Percentage error (MAPE), slope of linear regression in log-space (S), and Mean Symmetric Accuracy (MSA) computed analogous to Eq. 6, except the average of the log ratios was employed to infer the overall impact of noisy retrievals (Seegers et al., 2018). Note that RMSE and MAPE are computed to merely provide traceability to previous studies (Pahlevan et al., 2017b;Pahlevan et al., 2017c;Steinmetz and Ramon, 2018;Warren et al., 2019) and is not anticipated to serve as a reliable measure given the log-normal distribution of our matchup datasets (Fig. 3). In Section 4.1.2, where pairwise intercomparisons are presented, win rates (Seegers et al., 2018) were calculated per OWT and band for each AC pair. A winning processor for each pair, the one with lower ϵ, was assigned with unity. As a result, for each AC pair, a 4 (band) by 7 (OWT)element binary array (Fig. 9) filled with 0 s or 1 s was formed. This pairwise intercomparison was repeated for all the processors to generate seven binary arrays, which were added and normalized by the total number of pairwise comparisons (N=7). Having these normalized heatmaps created for all processors, the AC processors likely to produce the most accurate ρ w for any given band and OWT were determined. The description of the metric used for the universally common matchups is provided in Appendix B.

Fig. 5.
Overall performance of AC processors using the CVD matchups with all the OLI and MSI data combined. Scatterplots are shown in log-log scale and the number of matchups per processor is reported in Table 3. Detailed statistical metrics are available in Table S4. The solid black lines refer to the 1:1 line. Fig. 6. Same as Fig. 5, but for the AERONET-OC matchups. Statistical descriptors are provided in Table S5. The solid black lines refer to the 1:1 line. Fig. 7. Performance assessments as determined by the median symmetric accuracy (ϵ; Eq. 6) and median symmetric bias (β; Eq. 5) for OLI and MSI matchups combined. The dashed lines in the top row correspond to a 30% threshold (GCOS).

Aquatic reflectance products
The performance analyses are described in two main subsections: a) an assessment of all the valid matchups for each processor (Section 4.1.1) and b) an evaluation of common matchups between processors (Section 4.1.2). The latter enables a fair performance intercomparison using identical matchups while the former evaluates each individual processor independently, with an adequate number of matchups, permitting a full assessment of their practicality.

All matchups: Individual performance
To provide a straightforward and qualitative assessment of individual performance, the scatterplots for the CVD and AERONET-OC matchups for OLI and MSI combined are illustrated in Figs. 5 and 6. The full statistical metrics (ϵ, β, MSA, MAPE, RMSE, RMSLE, and S) are reported in Tables S4 and S5. Examining these two figures points to two primary observations. First, each processor performs differently when evaluated against CVD and AERONET-OC matchups, with some performing better in Fig. 8. Similar to Fig. 7, but computed only for the MSI's NIR bands. See Table 3 for the number of matchups. OC-SMART is excluded as it does not retrieve in the NIR bands. The dashed line on the left panel corresponds to a 30% threshold (GCOS). Statistical metrics are provided in Table S4. Fig. 9. Relative performance assessments [%] determined via aggregating pairwise intercomparisons (Section 3.6). Processors with brighter colors (white or yellow) are likely to generate high-quality ρ w for a given OWT and band. CVD and AERONET-OC matchups were combined to carry out this analysis. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) locations represented in the CVD and some showing promise in AERONET-OC regions. Second, the poorest performance consistently corresponds to the blue bands, particularly when CVD matchups are examined. This likely suggests the limited performance of processors to remove aerosol contributions (Pahlevan et al., 2017a). Despite differences in the number of matchups, similarities and discrepancies among the processors can further be highlighted. For instance, SeaDAS and OC-SMART appear to exhibit similar distributions in the 443 and 490 nm bands when assessed using AERONET-OC matchups. For CVD matchups in these bands, POLYMER-derived ρ w tend to saturate for ρ w > 0.03 whereas estimates from iCOR appear to be fairly reliable in this range. Interestingly, for the AERONET-OC matchups, the performance in ρ w (443) degrades for values <~ 0.02 where the data is widely dispersed. This signal-dependent performance is relatively less pronounced in the green and red bands, particularly for the CVD matchups.
A summary of processor performances in terms of retrieval errors (ϵ) and biases (β) in the visible bands is illustrated in Fig. 7. Overall, ϵ further corroborates the limited performance in the blue bands and better performance at AERONET-OC locations compared to those represented in CVD. It is also noticeable that none of the processors appear to uniformly meet the 30% retrieval accuracy requirements (GCOS) across all four bands. The performance is superior in the 560 and 664 nm bands for the CVD matchups and in the 560 nm band for the AERONET-OC matchups. This may partially be attributed to higher signal levels typically found in these spectral regions for the majority of our data. On the other hand, β shows that nearly all the processors underestimate ρ w for the CVD matchups with iCOR and ACOLITE returning minimal biases in the 490, 560, and 664 nm bands. This is opposite to the performance inferred from the AERONET-OC matchups, with SeaDAS exhibiting near-zero biases in the 443, 490, and 560 nm bands. In addition, banddependent performance (improving or degrading from the blue to red) is noted for nearly all the processors when considering CVD matchups. The performance analyses of individual processors for OLI and MSI visible bands are included in Appendix A. The assessment gives insights into the extent to which the processors are dependent on the radiometric performance of the two instruments . The errors and biases in the MSI NIR bands determined via the CVD matchups are provided in Fig. 8. The larger uncertainties are typically found in the 740 or 783 nm bands, and no consistent positive or negative biases are evident across the processors. Assuming a 30% threshold requirement for ρ w (700 < λ < 800 nm), retrievals from ACOLITE and MEETC2 appear to yield valid retrievals for the 705 nm band, although only 78 valid matchups (Table 3) were identified for MEETC2. Further, ϵ tends typically to rise from the 705 to 783 nm band for ACOLITE, C2X, and GRS while the 740 nm band carries larger uncertainties when retrieved from the other processors.

Common matchups: Performance intercomparison
The pairwise intercomparison strategy, on average, resulted in 60 samples for OWTs 1 and 4, 175 for OWTs 2 and 3, 85 for OWT5, and 20 for OWTs 6 and 7, a total of ~600 matchups. The number of matchups was higher for some AC pairs, such as iCOR and POLYMER, while SeaDAS and GRS intercomparisons led to smaller (but adequate) samples (e.g., 150 for OWT3). The heatmaps in Fig. 9 illustrate per band, per OWT performances estimated via win rates (Section 3.6). A paler color in each cell suggests an AC processor with higher chance of producing reliable ρ w products. For OWTs 1 and 2, OC-SMART and SeaDAS perform best with OC-SMART exhibiting a more consistent performance across the four bands. Among the processors, iCOR outperforms the rest of the processors in OWTs 3, 4, and 5, though it was assessed using CVD matchups only. The performance is superior in the blue and green bands in these OWTs. C2X is also found to perform well in OWT3 in the blue and red bands. For OWT6, iCOR together with ACOLITE, C2X, GRS, and MEETC2 appear to perform better than the rest of the schemes. ACOLITE is the obvious choice in OWT7 for which a superior performance is achieved in the 490, 560, and 664 nm bands. For this water type, GRS seems to offer a reasonable performance in particular in the 490 and 560 nm bands. It should also be noted that some of the processors are found to perform consistently across three or all four bands, a signature that underscores a processor's capability to preserve the spectral shape of ρ w . This applies to, for example, iCOR (OWTs 3, 4, 5), SeaDAS (OWTs 1 and 4), MEETC2 (OWTs 5 and 6), or C2X (OWTs 1, 2, and 3) -see Section 5.2 for further discussion. The secondary approach to our OWT analysis conducted using universally common matchups is provided in Appendix B.

Community validation database (CVD)
For the common matchup analysis, we focus only on the AC

Table 6
Errors in Chla r and TSS r estimated through Type I retrieval algorithms using common CVD matchups ( Fig. 2; N = 123). Performance of each retrieval algorithm (δChla p and δTSS p ) is also provided in the 2nd and 3rd columns. Note that the Novoa model is excluded here as it requires ρ w (865), which was not assessed in this exercise. Bestperforming retrieval algorithms per processor are boldfaced.  processors that returned adequate valid matchups to permit statistically robust inference (N=123). The results pertaining to Type I algorithms (Section 3.5) are included in Table 6. The errors (ϵ) in Chla r and TSS r (e. g., δChla r ) as well as in Chla p and TSS p (e.g., δChla p ) are presented to allow for comparisons between the theoritical limits and practical performance of the retrieval algorithms. Considering the MDN retrieval algorithm, δChla r are, on average, four times larger than δChla p . Despite its superior performance in predicting Chla p , MDN does not always outperform OCx for a given ρ w , suggesting its higher sensitivity to uncertainties in AC.
Overall, C2X appears to yield the lowest error (67.4%) and bias (10.9%). Fig. C1 further illustrates Chla r -Chla p plots, allowing to gauge the sensitivity of Type I algorithms to uncertainties in ρ w across a wider concentration range for all the processors. For TSS retrievals, C2X, ACOLITE, and iCOR are found to return more accurate products with ϵ within the 58-63% range, although the biases are appreciably high for ACOLITE and iCOR. The Nechad model outperforms SOLID when applied to ρ w , implying that SOLID is more sensitive to uncertainties in AC. Despite this practical limitation, the iCOR-SOLID combination allows reasonable TSS estimates, which may be attributed to iCOR's ability to better preserve the shape of ρ w as discussed in Section 4.1.2 ( Fig. 9). Moreover, as shown in Balasubramanian et al. (2020), the Fig. 10. Sensitivity of Chla retrieval algorithms (Type I) to uncertainties in ρ w obtained from five AC processors. Chla p were derived from in situ AERONET-OC radiometric matchups using MDNs (top)  and OCx (bottom) (O'Reilly and Werdell, 2019). Chla r were derived using the same algorithms. This analysis is not intended for comparing the absolute performance of MDN or OCx. Note that combined OLI and MSI's visible bands are applied here and that the axes are truncated to 120 mg m − 3 for enhanced visualizations. Number of valid matchups for each processor has been annotated. Fig. 11. Same as Fig. 10, but for TSS. TSS p and TSS r were derived from in situ AERONET-OC and satellite derived spectra via Novoa (top) and SOLID (bottom) models (Balasubramanian et al., 2020;Novoa et al., 2017).
performance of the Nechad model is limited in clear to moderately turbid waters where TSS < 4 g m − 3 (Fig. 2), which is why the slopes are much lower than unity for all the processors. The sensitivity of the Nechad and SOLID models to uncertainties in ρ w for all the processors is provided in Fig. C2.
We further investigated the performances of the five best-performing Chla algorithms (Table 5) using a small subset of universally common MSI matchups (N ~ 19). From the results provided in Table 7, one may infer that although ρ w (705) and ρ w (743) carry relatively large uncertainties (Fig. 8), the critical, but perturbed, spectral information within this wavelength range still results in improvements in Chla r when compared to that shown in Table 6. Further, we note that the propagation of uncertainities works in favor of the family of 2B algorithms such that the estimated errors in Chla r are less than that in Chla p . For instance, when applying Gons, Chla r from C2X shows ~20% error whereas Chla p contains ~39% retrieval errors. However, the 2B models (GU-2B and GI-2B in Table 5) are mostly suited to eutrophic waters (e.g., Chla > 8 mg m − 3 ) and are typically insensitive to the Chla ranges found in oligotrophic lakes or coastal waters. This is in contrast to MDN and Blend which provide valid retrievals across a broader range. An extensive analysis of the sensitivity of Type II algorithms to uncertainties in ρ w computed through all the CVD matchups is illustrated in Fig. C3.

AERONET-OC
For this sensitivity exercise, we present apparent error and bias metrics (ε and β ), as well as slopes of linear regressions (S) in Fig. 10. For an ideal processor that introduces no uncertainties in ρ w , the data distributions are expected to align with the 1:1 line. The sensitivity to uncertainties in ρ w across AERONET-OC sites (with mean Chla value of 2 to 3 mg m − 3 ) varies from 30 to 82%. In general, OCx derived Chla r estimated through SeaDAS and OC-SMART appear to adequately reproduce Chla p suggesting that the blue-green band ratios are preserved fairly well for these two processors. When applying MDN, OC-SMART and POLYMER are shown to contain major biases, which have implications in uncertainties introduced in ρ w across all the visible bands used for MDN estimations. For POLYMER, this also applies to retrievals from the OCx algorithm. We caution readers that this assessment merely points to the sensitivity of the AC processors and should not be interpreted as a verification for the choice of a Chla algorithm. To examine TSS retrievals (with a mean TSS p of 2.6 g m − 3 ), we applied the Novoa and SOLID models (Table 5) since Nechad was found to perform poorly for this range of TSS (>80% with TSS < 4 g m − 3 ) (Balasubramanian et al., 2020). From Fig. 11, one may infer that the Novoa model is less perturbed by the uncertainties introduced via the AC processors, with SeaDAS and OC-SMART introducing the least amount of noise and bias to ρ w . Similar to the Chla analysis, when applying SOLID, OC-SMART tends to add major biases to TSS r . ACOLITE is shown to yield noisy and biased TSS r for both algorithms, while POLYMER appears to introduce noise in SOLID estimates relative to that of Novoa.

In situ radiometry
In situ radiometric measurements acquired by various instruments, field setups and processing procedures added uncertainty to our matchup statistics which remains uncharacterized in this analysis. For example, recent studies have shown that above-water radiometric measurements (Method 7 in Table S1) may carry uncertainties ranging from 5 to 50% across the visible bands depending on environmental conditions . The same measurement technique with well-calibrated instruments was demonstrated to differ from AERONET-OC measurements by up to 10% . Uncertainties associated with other measurement techniques largely vary according to instrument calibration, environmental conditions, measurement protocols, and data processing schemes. Challenges of in-water radiometric measurements may be particularly acute in highly turbid waters or during intense phytoplankton blooms particularly for spectral bands with low penetration depths. In general, compiling a large pool of globally representative matchups, re-used from previous calibration/ validation research (Table S1), was a prerequisite for this effort and thus necessitated acceptance of some underlying uncertainty. We believe that such uncertainties did not alter the comparative assessment of the AC processors and the downstream products for three main reasons. First, use of the median symmetric accuracy metric (ϵ) reduced the impact of any highly uncertain in situ measurements. We carried out a few experiments by varying the matchup filtering criteria (Eqs. (3) & (4)) to evaluate the effect of the number of matchups on the metric. This analysis found no major changes in the apparent performances (e.g., Fig. 7). In contrast, we found the bias metric (β) to be sensitive to changes in the number of matchups, suggesting its sensitivity to outliers, although different matchup samples did not affect whether a processor was generally under-or over-estimating (i.e., bias signs were unaffected). Thus, the majority of the analysis presented was based upon the error metric (ϵ). Second, our measurements were made by hyperspectral radiometers built and/or assembled by different manufacturers, with varying calibration histories and measurement approaches. Creating a large pool of data using a combination of various methods likely minimizes any systematic errors in our performance assessments. Finally, the results from the pairwise common matchup datasets with the CVD and AERONET-OC matchups aggregated seem to converge with those from the individual performance assessments in Section 4.1.1, supporting negligible effects due to potential uncertainties within the CVD.
Regardless, more coordinated data acquisition approaches, such as those offered by AERONET-OC (Zibordi et al., 2009a) or the WATER-HYPERNET (Vansteenwegen et al., 2019) currently under development) will be crucial for better quantifying uncertainties and their sources in space and time. The recently established WISPstation network (Bresciani et al., 2020) is another example of a system with a major contribution to this exercise. Although the WISPstation instrumentation may not meet the highest level of standards required for ocean color validation practices (uncertainties <5%) (Zibordi et al., 2009b), the availability of these measurements extended our valid matchup datasets by ~10%. As automated networks are currently expanding, a harmonization among field radiometric observations will become important to facilitate the evolution of globally applicable AC processors and achieve improved retrieval accuracies (<<30%). Research cruises and field campaigns targeting wide ranges of coastal and inland water types to collect data for calibration/validation will remain essential to increase the environmental diversity of the CVD, and we strongly advocate for their continued support by space or operational agencies. The availability of a web-based platform like the Coastal Thematic Exploitation Platform (CTEP) was instrumental in expediting the coordination, processing, storage, and distribution of the products for matchup assessments. However, for the CVD matchups, a significant amount of time and effort had to be invested to assemble and re-format the in situ radiometric data. Hence, a community curated protocol for data preparation, formatting and ingestion for such analyses would greatly facilitate future AC assessments (IOCCG, 2019). Examples of databases containing in situ radiometric data for inland and coastal waters already exist (e.g., the SeaWiFS Bio-optical Archive and Storage System; SeaBASS, Lake Biooptical Measurements and Matchup Data for Remote Sensing data; LIMNADES), but need to expand. In addition, harmonizing data input from various sensors, including low-cost sensors, into central data systems is currently a subject under development in the MONOCLE project.

Matchup assessments
The performance assessments of AC processors determined using all available matchups, in general, agree well with those from the stricter common-matchup analysis (Sections 4.1.2). For example, both analyses suggest that iCOR and ACOLITE outperform other schemes for the diverse, mostly freshwater observations contained within the CVD or in turbid and/ or eutrophic ecosystems represented by OWTs 3 through 7. In clearer waters (OWTs 1 and 2), OC-SMART and SeaDAS are the best performers. These two processors, along with C2X, also performed well using only AERONET-OC matchups (Fig. B2), confirming their utility for coastal waters. The iCOR processor performed well in OWTs 3, 4, and 5, whereas ACOLITE performed poorly in coastal waters represented in AERONET-OC matchups. An examination of Fig. 9 and Fig. B1 points to POLYMER and MEETC2's potential for producing high-quality ρ w in OWTs 1 and 2, but their overall performance may not exceed those of SeaDAS and OC-SMART. Moreover, Figs. 5 and 7 suggest that iCOR produces better quality ρ w in the 443 nm band than that of other processors, an observation congruous with the analyses offered in Fig. 9 and Fig. B1.
There is also agreement between our assessments of the reflectance and the downstream water-quality products (Sections 4.1 and 4.2). For example, C2X returns ρ w (664) adequately (Figs. 6 and 7) which results in better quality TSS estimates using the Nechad algorithm (Table 6) in water types represented by the AERONET-OC matchups. The Novoa model was also demonstrated to exhibit slight sensitivities to uncertainties in ρ w generated via C2X (Fig. 11). It may therefore be advantageous to seek a fit-for-purpose AC that estimates satisfactory ρ w in only one or two individual bands for specific applications (e.g., TSS estimation using a single-band algorithm). ACOLITE, for instance, performs noticeably well in the 664 nm band for OWTs 4,5, 6, and 7 (Fig. 9); or the blue-green Chla algorithm shows a minimal sensitivity to uncertainties in ρ w derived from SeaDAS in OWTs 1,2,and 4 (Figs. 6,7,9,and 11). Using the CVD matchups and two-band (2B) retrieval algorithms (Table 7), we also demonstrated that C2X, POLYMER, and iCOR yield Chla with accuracies ranging from 25 to 30%, which potentially support their utility in eutrophic ecosystems. These improvements likely stem from counteracting effects of biases in ρ w (705) and ρ w (664) (Fig. 7). In contrast, Chla retrievals using these algorithms with ACO-LITE performed poorly, indicating that the ρ w (705)/ρ w (664) value is adversely affected by this processor.
Overall, the processors like iCOR and ACOLITE that often outperform other schemes for regions represented in CVD achieve 20 to 30% errors in the green and red bands and return Chla r and TSS r estimates with 25-70% uncertainties (Tables 6 and 7). Top-performing processors in coastal waters such as OC-SMART, SeaDAS, and C2X yield errors within the range of 15 to 30% in these spectral bands but the uncertainties in Chla r and TSS r are expected to remain between 50 and 70% given the assessments provided in Table 6 and Figs. 10 and 11. For these processors, the larger errors in Chla r stem from higher sensitivity to errors in ρ w (560) and ρ w (664), the spectral bands that contain the most relevant information for Chla estimation in OWTs 1, 2, and 3. In the blue bands, the errors tend to be larger (25-60%) for best-performing schemes, limiting their utility in scientific studies (see next section). It is also worth noting that the larger sensitivity of MDN and SOLID retrieval algorithms to uncertainties in ρ w are attributable to the fact that these models employ all the available spectral bands instead of a band ratio or a single band. For instance, SOLID uses spectral information in the visible bands to estimate spectral particulate backscattering. This observation likely favors retrieval algorithms that exhibit minimal sensitivities to uncertainties in AC even though they may not work adequately on high-quality ρ w .

Potential pathways towards enhanced performances
We find that the limited success (Fig. 7) in the removal of the contributing aerosol signal remains a major source of uncertainty in ρ w . Regardless of their approach, the AC processors utilize aerosol models that may not accurately represent aerosol types over land and coastal waters. The large inaccuracies (25-70%) in Chla r and TSS r , even from processors that generally meet the 30% threshold requirements for ρ w (560) and ρ w (664) (GCOS), further corroborate the demand for more precise ρ w across all the visible and NIR bands. Should improved downstream products (~ 10% uncertainties) be desired, the requirements on ϵ must be lowered to <10% (Cetinic et al., 2019). We caution readers that the largest uncertainties in ρ w were commonly found for the 443 nm band whose spectral signature makes little contribution to Chla and TSS predictions in most water types examined in this study . This situation applies to MDN and SOLID, but other algorithms considered here do not utilize information in this spectral band. Therefore, a thorough impact assessment of downstream products should encompass a wider array of products (e.g., a cdom ). For instance, given the large uncertainties in the blue bands, retrieving variables like the spectral diffuse attenuation coefficients (K d ) necessary for semi-analytical estimates of Secchi-disk depth (Lee et al., 2016), may be extremely challenging in most inland and coastal waters given the large uncertainties in the blue bands. Another potential reason for the limited performances in the 443 and 490 nm bands is likely the lack of rigor in handling the sky-glint or observations in areas with enhanced backscatter from haze in portions of OLI and MSI images east of nadir. A more robust approach in accounting for the residual signal in the blue may be achieved through the explicit inclusion of sun-sensor geometry, multiple-scattering near the water surface (Gilerson et al., 2018), and cloud fraction information.
On average, the opposite signs in biases in the CVD and AERONET-OC matchup assessments (Fig. 7) for nearly all the processors appear to suggest under-correction and overcorrection in inland and coastal areas, respectively. This reversal pattern rules out significant crosssensor biases between OLI and/or MSI ρ t products. The OLI-and MSIspecific performance analyses (Appendix A) further underscore the sensitivity of each processor to instruments-specific radiometric performance (biases and signal-to-noise-ratio; SNR). For instance, SeaDAS is found to be sensitive to the low SNRs in the MSI 865 and 1610 nm bands employed in approximating aerosol contributions Pahlevan et al., 2017b). The discrepancies found for other processors' performances require further investigations but, in general, a processor capable of producing harmonized ρ w products from OLI and MSI is anticipated to produce similar ϵ, which is minimally sensitive to the number of matchups. Note that this assessment could not be made on an OWT-specific basis (Fig. 9) due to insufficient matchups per OWT for each sensor. Another research area is identifying the source of uncertainties in the 740 and 783 nm bands, which also demand more matchup datasets collected with well-calibrated instruments.
AE continues to be a confounding phenomenon encountered in inland and nearshore coastal waters. It varies as a function of aerosol type, aerosol height, landcover type and its seasonal variability, topography, and other environmental conditions. Hence, geomorphological factors (extent, width, shape) of an ecosystem are not the only determinants for the presence or absence of AE. Interestingly, iCOR is the only processor considered in this study attempting to correct for AE and, overall, the best-performing algorithm across OWTs 3 through 6. Further, in this exercise, all the AC processors frequently failed to retrieve realistic ρ w in small Canadian lakes (e.g., Professor's Lake, a 0.26 km 2 reservoir located in Brampton, Ontario; Huot et al., 2019), whereas acceptable performance was observed in other fairly small ecosystems, such as Elistvere Järv (a 1.29 km 2 lake in Tartu County, Estonia) or tributaries of the Paraná River (Brazil-Paraguay border). Very low water-leaving radiance across the whole visible-NIR range in Canadian lakes dominated by colored dissolved organic matter, combined with a surrounding environment of dense vegetation (e.g., boreal forest) with high NIR reflectance may explain this issue. Focused regional performance assessments will aid in addressing such phenomena. Future global intercomparison exercises may additionally consider relevant environmental, physical, topographical attributes, and their spatiotemporal variability, to improve characterization and correction of AE. For the AERONET-OC sites, we further analyzed the retrieval accuracies with respect to AOT, wind direction, water vapor, and imaging geometry and found very weak or no correlations with ϵ and/or β. With expected improvements in the performance of the AC schemes, higher degrees of correlations with environmental and physical variables are anticipated. Further, dependencies on aerosol types and AOTs may be analyzed using observations at nearby AERONET sites.
Since both C2X and OC-SMART exhibited promising performances, we encourage continued research into novel machine-learning modeling solutions which could capture some of the more subtle influences on retrieval performance mentioned above. These may include improved forward radiative transfer modeling, enhanced representation of aerosols and in-water optical properties, leveraging latest architectures (e.g., activation functions) or models that are adept in identifying better solutions to the inverse problem. Similarly, the prior and posterior probabilities learned and applied by MEETC2 also showed promising results for retrieving ρ w (700 < λ < 800 nm); hence, future investments in applying Bayesian approaches would be worthwhile (Frouin et al., 2019;Thompson et al., 2019).

Recommendations
This international, collaborative effort is one of the first major community-wide research activities to address a critical component of satellite data processing for monitoring and scientific studies of aquatic biogeochemical cycling and water quality at a global scale. Among the objectives of ACIX-Aqua was to make recommendations to the user community, satellite practitioners, water resource managers, as well as space and operational agencies for the implementation of an AC approach suitable for their applications. The present work demonstrates that research in this area is highly topical and needs to continue to achieve a reduction of uncertainties in ρ w and consequently in downstream quantities, such as the inherent optical properties and water constituents (e.g., Chla and TSS). The problem addressed is highly multidimensional, and we discussed uncertainity metrics specific to AC processors, specific to spectral bands of satellite instruments, specific to OWTs and specific to two commonly estimated downstream products. In order to simplify these results and to provide a guideline for end-users to consider when processing or re-processing OLI and MSI images for coastal and inland aquatic applications, we offer the following recommendations and considerations.
• For global studies of inland and coastal waters, there is no single solution, and a preferred AC processor may be chosen according to the specific scientific objective and application. To facilitate this choice, we provide a ranking of AC processors per OWT based on the relative pairwise performance presented in Section 4.2.1 in Table 8. We note that Table 8 is a simplification and higher granularity in the decision process may be achieved by consulting our results with respect to the spectral bands of interest, the desired downstream products, coastal versus inland water applications, or specific satellite sensor (Appendix A). For example, in nearshore coastal waters (OWTs 1, 2, and 3), OC-SMART, SeaDAS, and C2X outperform the rest of the processors (Fig. B2). In these regions, while both SeaDAS and C2X show better performance for the OLI data, OC-SMART is found to perform equally for both OLI and MSI (Fig. A1).
• We demonstrated that each processor has different degrees of sensitivity to varying choices of constituent retrieval algorithms (Section 4.2 and Appendix C may be consulted). This suggests that a switching scheme to select the optimal AC based on OWTs may be a promising approach. Therefore, we recommend further studies dedicated to identifying optimal retrieval algorithms for each processor. • The uncertainties associated with our Chla and TSS matchups demonstrate that a cautious approach is needed where products are intended for scientific studies where high accuracies are required. For example, ϵ < 10% in ρ w is encouraged for the identification of subtle climate-change signals, estimation of the absorption of colored dissolved organic matter (Cao et al., 2018) relevant to carbon budget assessments, and analysis of the variability of in-water particulate backscattering (Zawada et al., 2007). We speculate that this uncertainty requirement should also yield uncertainties <20% in global Chla and TSS products and allow robust assessments of seasonal variability across a wide range of aquatic ecosystems. One should note that this requirement shall not preclude existing products (e.g., Tables 6 and 7) obtained from current versions of the AC processors for global and/or regional water quality monitoring applications. With future hyperspectral missions in sight and their potential ability to better address scientific questions related to phytoplankton properties , more stringent uncertainty requirements in ρ w , similar to those adopted for the Plankton, Aerosol, Clouds, and Ecosystem (PACE) mission (Cetinic et al., 2019) are expected. • While AC processors are evolving, for some of the application areas (e.g., HAB detection) novel regional techniques that fully or partially bypass the AC process and the associated uncertainties are viable alternatives (Binding et al., 2013;Cao et al., 2020;Matthews and Odermatt, 2015;Smith et al., 2021;Stumpf et al., 2016). For a global applicability of such an approach, a further global data drive is needed to compile existing optical proxies of water quality indicators (e.g., Chla, TSS, turbidity, Secchi-disk depth). • Further research dedicated to enhancing the representativeness of aerosol models integrated into the AC processors is required. Future mission designs should consider the inclusion of observation modalities (e.g., polarimetric, hyperspectral, and multi-angular radiometry as well as ranging) to improve the discrimination and/or characterization of aerosol types, heights, and optical thickness (Frouin et al., 2019). Moreover, additions of high-fidelity radiometric measurements in the deeper blue bands (He et al., 2012) and/or within the ultraviolet region should further constrain the solution space for aerosol retrievals, especially, in dystrophic ecosystems (e.g., boreal lakes) where negligible ρ w is expected in this spectral range.

Conclusion
The aquatic subgroup of the second Atmospheric Correction Intercomparison eXercise (ACIX-Aqua), a joint ESA and NASA initiative under the CEOS direction, was specifically developed to carry out a comprehensive assessment of the existing AC processors for Landsat-8 and Sentinel-2 data processing over inland and coastal waters. This required a community-wide data sharing effort from field campaigns in freshwaters around the globe and utilizing the observation records available through the primary coastal ocean AERONET-OC sites.

Table 8
Ranking of AC processors specified using band-average (column-wise) performances in Fig. 9. Note that iCOR was only analyzed using CVD matchups. Order  OWT1  OWT2  OWT3  OWT4  OWT5  OWT6  OWT7   1  OC-SMART  OC-SMART  iCOR  iCOR  iCOR  iCOR  ACOLITE  2  SeaDAS  SeaDAS  C2X  SeaDAS  ACOLITE  MEETC2  GRS  3  C2X  POLYMER  MEETC2  C2X  MEETC2  ACOLITE  iCOR  4  POLYMER  C2X  OC-SMART  OC-SMART  SeaDAS  C2X  MEETC2 Through a considerable effort of in situ data collation, we were able to assess comprehensively the performance of eight different AC processors. We found marked performance differences between inland and coastal waters, where some processors performed satisfactorily in freshwater bodies, while others showed superior performance in coastal waters. Overall, the uncertainties were lower in the coastal environments. By combining our freshwater and AERONET-OC matchups, we produced performance matrices to determine the AC processors considered most capable of generating reliable products for specific OWTs. Despite the processors failing to meet a 30% error threshold across all the visible bands in freshwater environments, we showed (for best-performing AC processors) that the median errors in Chla and TSS range from 25 to 70%. The derived products from best-performing AC processors should be suitable for some water quality monitoring activities, such as hot-spot identification and assessing impacts of episodic events. However, accuracies demonstrated here may limit their suitability for long-term trend studies targeting incremental shifts in climate, biogeochemical cycling, organic and/or inorganic particle discrimination, or phytoplankton property identification. It is anticipated that near-future research will lead to advancements in the performance of AC processors by employing more representative aerosol types and/or bio-optical models depending on the underlying mechanisms of the AC processors. Further, characterizing and quantifying adjacency effects deserve major advancements to ensure practical products for scientific studies or water quality assessments in small or hydrologically complex water bodies.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment
We acknowledge NASA's AERONET team for maintaining the network. Our greatest appreciation is extended to the AERONET-OC Principal Investigators (Giuseppe Zibordi, Brent Holben, Alex Gilerson, Samir Ahmed, Burton Jones, Hui Feng, Young-Je Park, Heidi Sosik, Sherwin Ladner, Timothy Moore, Menghua Wang, Steven Greb, Sarah Bartlett, Dimitry Van der Zande) and their corresponding funding agencies (e.g., JAXA's GCOM-C project). We are thankful to Yannick Huot and Giuseppe Zibordi for their general comments as well as to three anonymous reviewers for providing critical reviews and constructive comments that improved the presentation of the results. The field campaigns conducted in the Brazilian territory were funded by the São Paulo Research Foundation (FAPESP) Project 2014/23903-9. Steef Peters, Evangelos Spyrakos, Peter Hunter, Andrew Tyler, Martin Ligi, and Mark Warren were funded under the European Union's Horizon 2020 research and innovation program under grant agreements No. 776480 (MONOCLE) and No. 730066 (EOMORES). Krista Alikas and Martin Ligi were funded by EOMORES. Evangelos Spyrakos, Peter Hunter, and Andrew Tyler were also funded under the UK Natural Environment Research Council (NERC) projects GloboLakes (NE/ J024279/1) and INCIS-3IVE (NE/L013312/1). BONUS FerryScope for in situ data collection in European waters is also acknowledged. The consistency in image processing and data extraction among all the processors were made possible via the CTEP platform under ESA support. Nima Pahlevan was funded under NASA ROSES contract # 80HQTR19C0015, Remote Sensing of Water Quality element, and the USGS Landsat Science Team Award # 140G0118C0011.

Appendix A
A sensor-specific assessment of the AC processors is shown in Fig. A1, where the number of valid matchups per processor is also provided. For the 443 nm band analyzed via the AERONET-OC matchups, C2X and SeaDAS exhibit distinct performances with higher uncertainties associated with MSI ρ w . Similar differences also exist for ACOLITE and GRS as revealed through CVD matchups. When considering the 664 nm band, C2X, OC-SMART, and iCOR show different performances, i.e., > 50%, for OLI and MSI determined via both CVD and AERONET-OC matchups. Another example is ACOLITE which appears to output relatively consistent ρ w (664) in freshwaters well represented in the CVD matchup pool.ool Fig. A1. Performance assessments demonstrating OLI-MSI interconsistency for the visible bands using valid matchups for each individual processor. Bars associated with the 664 nm chart are labeled with the number of valid matchups for each sensor. The dashed lines correspond to a 30% threshold (GCOS).

Appendix B
As the secondary performance assessment approach for each OWT (Section 3.4), we identified universally common matchups among all the processors that return adequate numbers of matchups for the CVD and AERONET-OC datasets. To rank the processors, a composite metric was formulated using the five (unsigned) log-based figures of merit, i.e., ϵ, |β|, RMSLE, |1 − S|, and MSA (Section 3.6). The metric values for each OWT were stored in a 4 x n array where n is the number of processors evaluated for the four visible bands. These values were then normalized within the [0,1] interval for each OWT and band. This strategy allowed for a uniform scaling of all the metrics. Assuming equal weights for each, the metrics for each processor/OWT/band were added, yielding a value ranging from 0 to 5 -the best-performing AC scheme takes on a value close to zero. This assessment is virtually equivalent of performance analyses based on radar (or spider) diagrams. Note that the analysis has been carried out independently for the CVD and AERONET-OC matchups. Here, only processors returning larger numbers of valid ρ w matchups were evaluated. Therefore, GRS and SeaDAS from the CVD, and GRS from the AERONET-OC analyses, were eliminated. Further analyses suggested that including OC-SMART reduces the number of valid matchups for the CVD assessment; thus, this processor was also eliminated Fig. B1. Relative performance assessments determined via a composite metric derived from the universally common CVD matchups. The number of common and valid matchups is denoted by N. Cooler colors indicate a better performance. Five log-based metrics, i.e., RMSLE, ϵ, |β|, MSA, and |1-S| (Section 3.6), were utilized to produce the heatmaps. Band-average metrics were applied to rank the processors from left to right. Note that no common matchups were identified for OWT6. Fig. B2. Same as Fig. B1, but derived from AERONET-OC data analysis. Less than five common matchups were identified for OWTs 5,6, and 7, and hence they are not illustrated. Fig. C1. Satellite-derived Chla (Chla r ) against pseudo Chla products (Chla p ) estimated from the CVD radiometric matchups. The plots show sensitivity of two (Type I) Chla algorithms to uncertainties in ρ w . Note that combined OLI and MSI's visible bands are applied here, and that the same Chla algorithms are used to create each scatterplot (e.g., MDN for top row). For an ideal AC processor, data distributions align with 1:1 lines. This assessment does not provide insights into the absolute performance of Chla algorithms (see Sections 3.1 and 3.5).

Fig. C2
. Same as C.1, but for TSS products. Top row shows sensitivity of the Nechad model to uncertainties in ρ w derived from different processors. For this algorithm, data with TSS < 4 g m − 3 were removed from the matchups leading to fewer matchup samples. Fig. C3. Same as C.1, but using Type II algorithms for MSI matchups only. The algorithms include MDN , GU-2B (Gurlin et al., 2011), GI-2B (Gilerson et al., 2010), Gons (Gons et al., 2002), and Blend (Smith et al., 2018). GI-2B appears to be least sensitive to uncertainties in ρ w .