Improved Prediction of Dimethyl Sulfide (DMS) Distributions in the 1 NE Subarctic Pacific using Machine Learning Algorithms

. Dimethyl sulfide (DMS) is a volatile biogenic gas with the potential to influence regional climate as a 8 source of atmospheric aerosols and cloud condensation nuclei (CCN). The complexity of the oceanic DMS cycle 9 presents a challenge in accurately predicting sea-surface concentrations and sea-air fluxes of this gas. In this study, 10 we applied machine learning methods to model the distribution of DMS in the NE Subarctic Pacific (NESAP), a 11 global DMS hot-spot. Using nearly two decades of ship-based DMS observations, combined with satellite-derived 12 oceanographic data, we constructed ensembles of 1000 machine-learning models using two techniques, random 13 forest regression (RFR) and artificial neural networks (ANN). Our models dramatically improve upon existing 14 statistical DMS models, capturing up to 62% of observed DMS variability in the NESAP and demonstrate notable 15 regional patterns that are associated with mesoscale oceanographic variability. In particular, our results indicate a 16 strong coherence between DMS concentrations, sea surface nitrate (SSN) concentrations, photosynthetically active 17 radiation (PAR) and sea surface height anomalies (SSHA), suggesting that NESAP DMS cycling is primarily 18 influenced by heterogenous nutrient availability, light-dependent processes and physical mixing. Based on our 19 model output, we derive summertime, sea-air flux estimates ranging between 0.5-2.0 Tg S yr -1 in the NESAP. Our 20 work demonstrates a new approach to capturing spatial and temporal patterns in DMS variability, which is likely 21 applicable to other oceanic regions.


Introduction
Dimethyl sulfide (DMS), a volatile biogenic gas, is an important component of the marine sulfur cycle.This molecule contributes the largest fraction of bulk non-sea salt (NSS) sulfate emissions to the atmosphere (Bates et al., 1992), where it is rapidly oxidized to form aerosols that act as cloud condensation nuclei (CCN; Charlson et al., 1987;Hegg et al., 1991;Korhonen et al., 2008), potentially influencing regional albedo and climate (Charlson et al., 1987;Ayers and Cainey, 2007).Given its potential role in climate regulation, and recognized importance to marine microbial metabolism (Vila-Costa et al., 2006) and food web interactions (Nevitt, 2008), substantial research has focused on characterizing DMS dynamics in seawater.This work has revealed considerable complexity in the oceanic DMS cycle, which has limited the development of simple predictive algorithms describing spatial and temporal DMS distributions.
Oceanic DMS production and loss are tightly linked with the biological cycling of the related metabolites dimethyl sulfoniopropionate (DMSP) and dimethyl sulfoxide (DMSO).DMS is believed to be primarily derived from the cleavage of DMSP (Kiene and Linn, 2000), but it can also be cycled through biological DMSO reduction (Spiese et al., 2009) and oxidation (Lidbury et al., 2016), and abiotically by light-dependent reactions (del Valle et al., 2007;Royer et al., 2016).DMS cycling is influenced by suite of environmental and ecological factors, including release from phytoplankton cells into the dissolved pool via grazing (Dacey and Wakeham, 1986), viral lysis (Malin et al., 1998), or exudation.Oxidative stress generated by other variables such as temperature (Kirst et al., 1991), salinity (Dickson and Kirst, 1987), UV radiation (Kinsey et al., 2016), and nutrient limitation (Bucciarelli et al., 2013;Spiese & Tatarkov, 2014) may also enhance the cycling of DMSP and DMSO, which may regulate DMS concentrations through cascading oxidative pathways (Sunda et al., 2002).Finally, variability in surface wind fields can modulate the rates of DMS sea-air exchange, providing a significant source of heterogeneity in surface water DMS concentrations (Royer et al., 2016).These examples illustrate the complex non-linearity of the oceanic DMS cycle.
Over the past two decades, a number of approaches have been developed to model DMS distributions at both global (Bock et al., 2021;Galí et al., 2018;Simó and Dachs, 2002;Vallina and Simó, 2007) and regional (Watanabe et al., 2007) scales.These models have been largely based on linear regression techniques to estimate DMS using one or two predictors.To date, these studies have focused on a number of variables, including ratio of chlorophyll a (Chl-a) to mixed layer depth (MLD) (Simó and Dachs, 2002), sea surface temperature (SST) and nitrate (SSN) (Watanabe et al., 2007), solar radiation dose (SRD) (Vallina and Simó, 2007), photosynthetically active radiation (PAR) and modelled DMSP concentrations (Galí et al., 2018).Some of these models have demonstrated reasonably good performance at global scales, but their predictive power is generally diminished at regional scales (Herr et al., 2019), failing to accurately resolve important smaller-scale features (Belviso et al., 2003;Nemcek et al., 2008;Royer et al., 2015;Tortell, 2005b).
In recent years, machine-learning algorithms have been increasingly used to derive predictions for nonlinear oceanic systems.For example, these methods have been successfully applied to describe the spatial and temporal patterns of global methane flux (Weber et al., 2019) and carbon export (Roshan and DeVries, 2017).To our knowledge, only two studies have thus far applied machine-learning to describe DMS distributions, with one study focused on the Arctic (Humphries et al., 2012) and the other exploring a global domain (Wang et al., 2020).
Despite producing algorithms with reasonable predictive skill, these two studies found limited success in resolving the underlying relationships driving DMS variability.This was partially due to a reliance on indirect sensitivity tests assessing the importance of predictor variables, and also, potentially, from the large-scale averaging applied to the underlying data fields (1x1 o ; 111 km 2 ).Analyses at higher spatial resolution may reveal mesoscale (roughly 20-200 km) and sub-mesoscale (roughly 1-20 km) patterns that would otherwise be obscured, thereby increasing predictive strength.(Royer et al., 2014;Saltzman et al., 2009;Tortell, 2005a) has provided marine DMS observations at a significantly higher resolution, yielding greater spatial and temporal data coverage.These new datasets potentially enable new insights into small-scale and regional patterns in oceanic DMS distributions, as well as the characterization of oceanic DMS 'hot-spots'.One such global DMS hotspot is the northeast subarctic Pacific (NESAP) (Asher et al., 2017;Herr et al., 2019;Lana et al., 2011), a region encompassing both highly productive coastal upwelling regimes, and off-shore, iron-limited waters (Martin and Fitzwater, 1988).Several factors have been proposed to account for the elevated DMS production in the NESAP, including increased productivity from entrainment and upwelling along coastal fronts (Asher et al., 2017), and the stimulation of DMS production in response to oxidative stress in low iron waters (Sunda et al., 2002;Herr et al., 2020).Although multiple studies have examined empirical relationships between DMS and various oceanographic factors in the NESAP (Watanabe et al., 2007;Herr et al., 2019;Asher et al., 2017Asher et al., , 2011)), these have all reported low predictive skill based on simple linear correlation approaches.To date, machine-learning approaches have not been applied to describe DMS distributions specifically in this region.
Here, we present an approach to modelling summertime NESAP DMS concentrations and sea-air fluxes using ensemble random forest regression (RFR) and artificial neural network (ANN) machine-learning algorithms.
Our statistical models leverage field observations of DMS collected across the NESAP between 1997 to 2017 to generate a summertime DMS climatology mapped at a higher spatial resolution than previous efforts (Simó and Dachs, 2002;Vallina and Simó, 2007;Galí et al., 2018;Watanabe et al., 2007;Humphries et al., 2012;Wang et al., 2020).This new modelling approach represents a significant improvement over previous methods and predicts regional DMS distributions that are coherent with underlying patterns of oceanographic variability.Most notably, the modelled DMS concentrations and sea-air fluxes can be explained, to a large extent, by regional and mesoscale patterns in nutrient supply and physical mixing dynamics.Based on the output of our models, we present summertime sea-air flux estimates in close agreement with previous studies (Herr et al., 2019;Lana et al., 2011), further highlighting the importance of the NESAP as a globally-significant sulfur source to the atmosphere.

Data
A combination of data sources was used in training our machine-learning models to build a summertime DMS climatology.For this study, we restricted DMS measurements to the months of June, July and August between 1997 to 2017 in the NESAP (43-60 o N, 147-122 o W).A total of 26,201 data points were obtained from the NOAA PMEL repository (https://saga.pmel.noaa.gov/dms/;last accessed: February 3, 2021), including measurements derived from purge and trap gas chromatography and membrane inlet mass spectrometry.The DMS data were binned to a monthly resolution, regardless of year, and averaged into 0.25 x 0.25 o grid cells.
Predictor data used to build our machine-learning models included the following variables derived from the NASA Aqua MODIS satellite at level L3 monthly 0.036 o resolution: sea surface temperature (SST), the ratio of normalized fluorescence line height to chlorophyll a (nFLH:Chl-a), instantaneous and daily observed photosynthetically active radiation (iPAR and PAR, respectively), particulate inorganic carbon (PIC), the absorption of gelbstof and detritus at 433 nm (acdm(443)), and diffuse attenuation coefficients at 490nm (Kd).
Satellite-based PIC is considered as a proxy for the abundance of coccolithophores and other calcified phytoplankton (Franklin et al., 2010), whereas the acdm(443) product is considered a proxy for the distributions of chromophoric dissolved organic matter (CDOM) (Nelson & Siegel, 2013), which is thought to be an important photosensitizer of DMS (see Sect. 4.1).For observations prior to 2004, data were from either SeaWiFS or Terra MODIS when SeaWiFS data was unavailable (e.g.nFLH and iPAR).As described below, Kd and PIC were later excluded from the final models (see Sect. 2.6).
The following predictor variables were also used: 6-day averaged sea surface height anomalies (SSHA) derived from the TOPEX/Poseidon satellites at 0.17 o resolution; Level L4 ESA Sentinal-3 Copernicus monthlyaveraged 0.25 o wind speeds; net primary productivity (NPP) from the Vertically-Generalized Production Model (VGPM; Behrenfeld & Falkowski, 1997) at monthly 0.25 o resolution; sea surface nitrate from the 2018 World Ocean Atlas at monthly 1 o resolution (Garcia et al., 2019); and mixed-layer depth (MLD) and sea surface salinity (SSS) from the MIMOC climatology at 0.5 o resolution (Schmidtko et al., 2013).Except for MIMOC data, all predictors were restricted in time to the corresponding years of DMS sampling (1997 to 2017).Net community productivity (NCP) was estimated from the algorithm of Li & Cassar, (2016;using NPP and SST).As with DMS observations, predictor data were interpolated to a 0.25 x 0.25 o average monthly resolution using linear radial basis interpolation functions.Interpolation was constrained to the oceanic region by masking out land pixels using ETOPO2 bathymetric (0.033 o resolution) binned at 0.25 x 0.25 o resolution.Data sources can be found in Table 1.

Machine-learning models
We compared the performance of random forest regression (RFR) and artificial neural network (ANN) models at the regional scale.In both cases, the models were built as an ensemble of either 1000 individual decision trees or individual networks to minimize bias in predictions.The input data were randomly divided for use in model training (80%) and external testing (20%).Although RFR is not sensitive to large differences in predictor variance, predictor data were standardized in both models by normalization to their respective mean and standard deviation.
Additionally, we applied an inverse hyperbolic sine (IHS) transformation to the DMS data prior to training.Testing results indicated that IHS yielded slightly better performance than the more traditional logarithmic transformations for both of our models.
Both our ANN and RFR models followed a similar design to Weber et al. (2019).Our ANNs were built using a feed-forward framework consisting of a single input node, two hidden layers each consisting of 30 neurons (using a sigmoidal activation function), and a single output layer (using a linear activation function).A Bayesian L2 (Ridge) regularization parameter was tuned to minimize overfitting.Each individual decision tree within the RFR was trained using the standard CART algorithm and constrained to a max depth of 25 decision splits, the simplest configuration determined to perform well and minimize overfitting.

Sea-to-air fluxes
Sea-air DMS fluxes (FDMS, µmol m -2 d -1 ) were calculated from the monthly-averaged observed and modelled DMS values for June, July and August.FDMS was calculated using the gas transfer velocity (k, cm 3 hr -1 ) following the modified approach of Webb et al. (2019): where the factor of 0.24 converts to the values to daily fluxes.Since our fluxes were calculated from our monthly averaged models, the gas transfer velocity was calculated using the approach from Simó & Dachs (2002), as modified by (Nightingale et al., 2000).This approach is necessary to correct for differences due to the non-linear relationship between DMS and wind speed (Livingstone and Imboden, 1993) when using monthly-averaged, satellite-derived wind speeds.Assuming a Rayleigh distribution (ξ =2), k can be defined as: where η is the quotient of the wind speed (m s -1 ) by the gamma function Γ(s) (using  = 1 + 1 ξ ), and ScDMS is the DMS-specific Schmidt number (cm 3 hr -1 ) as defined by Saltzman et al. (1993): = 2674 − 147.12() + 3.72( 2 ) − 0.038( 3 ) (3) Regional summertime fluxes ( ̅  , Tg) were calculated as the average quantity of DMS-sulfur emitted over 92 days through the area of the mapped study region (1.28x10 7 km 2 or 85.0% of the total bounded area).

Comparison against existing algorithms
Simple linear regression (LR) and multiple linear regression (MLR) models were built for comparison against the machine-learning algorithms.We also tested the performance of our RFR and ANN models against the published algorithms of Simó & Dachs (2002), Watanabe et al. (2007), Vallina & Simó, (2007), and Galí et al.
(2018) (hereafter referred to as SD02, W07, VS07, and G18, respectively).SRD is calculated here using MLD as described by Vallina & Simó (2007): Each of the four algorithms was assessed using both their original coefficients and coefficients tuned to our NESAP dataset using nonlinear least-squares optimization.

Controls on DMS variability
Principal component analysis (PCA) was applied to assess the relationships between DMS and the nine predictors used to build the RFR and ANN ensembles.Additionally, non-parametric spearman rank correlations were calculated between each variable and both the modelled and observed DMS concentrations.Correlation analysis was also extended to assess the role of taxonomy on predicted DMS concentrations, using the outputs of a chlorophyll-a based taxonomic algorithm by Hirata et al. (2011) with NESAP-tuned coefficients (Zeng et al., 2018).

Sensitivity Tests and Predictor Selection
To inform our selection of grid size, we assessed the performance of both the RFR and ANN models using grid cells ranging from 0.25 to 5 o (Fig. 1).From this analysis, we found that model accuracy was highest at 0.25 o resolution (see Sect. 3.1).Smaller grid sizes would presumably further improve model accuracy, but at a significantly higher computational cost.We also tested the influence of other biological predictor variables on the performance of the RFR and ANN models, using either NCP, NPP, Chl-a, or PIC.These sensitivity tests indicated no significant difference between the various biological predictor variables, although accuracy was slightly reduced when PIC was used.
We therefore selected NCP as the biological predictor variable within our model framework.We also removed Kd as a predictor variable after further sensitivity testing indicated that its exclusion slightly improved results.

Model evaluation
To benchmark the performance of our RFR and ANN models, we first evaluated the predictive skill of four existing empirical DMS algorithms (SD02, W07, VS07, & G18), in addition to simple and multiple linear regression models.Previous studies have demonstrated that these empirical algorithms show strong predictive skill (R 2 =0.53-0.84)over large scales and in some oceanic regions (Simó and Dachs, 2002;Galí et al., 2018;Watanabe et al., 2007), but significantly poorer performance in the NESAP (Herr et al., 2019).Consistent with these results, we found that the SD02, W07, VS07, and G18 did not accurately predict NESAP DMS distributions, even with regionally tuned coefficients (Fig. 2, R 2 =0-0.01).We also found that simple and multiple linear regressions performed poorly (R 2 =0-0.05;Fig. 2, 3), yielding virtually no explanatory power for surface water DMS distributions in the NESAP (R 2 ≤0.05).Relative to other published modelling approaches, both the RFR and ANN models dramatically improved the representation of NESAP DMS variability, achieving significantly higher predictive accuracy (Fig. 2, 3).The collective ensembles of both the RFR and ANN models yielded strong performance, explaining up to 62% of the observed DMS variability (R 2 =0.61-0.62;Fig. 3).For individual models within the ensembles, the AAN method provided slightly better results (R 2 =0.16-0.50),compared to the individual RFR models (R 2 =0.16-0.43).As observed for predicted DMS concentrations, the models showed lower predictive power for sea-air DMS fluxes at coarser resolution (Fig. 1).

DMS distributions and sea-air fluxes
The predicted spatial distribution of DMS was generally consistent between observations and the RFR and ANN methods (Fig. 4a,c,d).The average model derived DMS concentrations was 4.0 ± 2.1 nM and 4.7 ± 3.0 nM (mean ± SD) for the RFR and ANN ensemble models, respectively, with a similar range from 0.3 to 84.3 nM.In both models, the highest DMS concentrations were largely constrained to coastlines and within the Alaska Gyre adjacent to the Aleutian Islands (Fig. 4b-c, 8C).The greatest discrepancy between DMS concentrations from the two models was observed in these regional 'hotspots', where the ANN models emphasize high DMS within the Alaska gyre, while the RFR models emphasize elevated coastal DMS concentrations (Fig. 4b).The models deviated on average by 0.49 nM, with the greatest offsets observed in an area of particularly sparse DMS observations in the Alaska Gyre (Fig. 4a,b).Future observational data in this region should help improve model performance.Sea-air DMS fluxes (Fig. 4e,f) derived from ANN predictions were 18% higher, on average, than RFR predictions, largely due to higher predicted values in the Alaska Gyre (Fig. 4d-e, Table 2).The distribution of ANN sea-air fluxes was also closer to ship-based observations (Fig. 5).Predicted regional fluxes ranged from 0.7 to 107 µmol m -2 d -1 between the two models (Fig. 4e,f, 5), with the highest predicted DMS emissions in August, when derived sea-air fluxes were approximately 1.5-fold greater than in June and July (Table 2).Our models yielded a summertime integrated sea-air flux of 0.31±0.19Tg DMS-derived sulfur (equivalent to 0.5 to 2.0 Tg S yr -1 ; Table 2), in good agreement with recent estimates based on compiled ship-based observations (0.3 Tg; Herr et al., 2019) and existing climatological estimates (Table 2; Lana et al., 2011).This summertime mean value is equivalent to ~4-8% of total global DMS sea-air emissions annually, assuming an uncertainty ranging between 15 to 28 Tg S yr - 1 in global estimates (Bock et al., 2021).This result further emphasizes the NESAP as a globally significant DMS source to the atmosphere.Tg S Tg S June 5.9 ± 3.7 6.0 ± 3.9 0.22 ± 0.13 0.44 ± 0.20 July 6.5 ± 3.0 7.7 ± 3.8 0.26 ± 0.12 0.33 ± 0.17

Drivers of DMS variability
In addition to modelling the spatial and temporal distribution of surface water DMS in the NESAP, we examined the influence of different oceanographic variables as model predictors.As expected based on previous work (Herr et al., 2019), no single predictor was found to exert a dominant control on modelled DMS distributions from either the RFR or ANN models (Fig. 6, 7).Rather, the relationship between DMS and other oceanographic variables exhibited significant region-specific patterns.One of the most compelling regional signatures was the apparent relationship between DMS and SSHA.In both models, we found significant positive correlations between DMS and SSHA (ρ=0.35,0.41 for RFR and ANN, respectively) across the full spatial domain, with a particularly notable relationship along the northern Alaskan coastline (Fig. 8, 9).Here, strong winds (Fig. 9j-l), coupled with the northeastern Alaska current flow, produce two characteristic oceanographic features in the NESAP: strong, semi-permanent mesoscale eddies collectively referred to as the Haida, Sitka and Yakutat eddies (Fig. 8a), and the formation of the high nutrient, low chlorophyll (HNLC) Alaska Gyre (Fig. 8c; Okkonen et al., 2001;Whitney et al., 2005).Both the monthly (Fig. 9a-i) and summertime-averaged (Fig. 8a Other variables appear to exhibit a more localized or minimal influence on DMS cycling.For instance, both NCP and DMS are elevated in productive nearshore waters, but NCP generally correlates weakly with both RFR-and ANN-derived DMS concentrations (ρ=0.08,0.09 for RFR and ANN, respectively).Similarly, modelled phytoplankton taxonomic composition (Hirata et al., 2011;Zeng et al., 2018) was not significantly correlated with predicted DMS concentrations (ρ<0.1).Although strong, persistent winds appear to sustain low DMS concentrations off the coast of Oregon and Vancouver Island (Fig. 9), wind speeds only weakly correlate with DMS overall for the region (ρ=-0.15and -0.12 for RFR and ANN, respectively).Additionally, high PAR in these areas correspond with low DMS concentrations (Fig. 6d) and there is an overall negative correlation between PAR and DMS for the region (Fig. 6, 7; ρ=-0.21 and -0.29 for RFR and ANN, respectively).Finally, despite hypothesized links between DMS cycling and iron limitation in the NESAP (Levasseur et al., 2006;Merzouk et al., 2006), nFLH:Chl-a ratios (taken as a proxy for phytoplankton iron stress; Behrenfeld et al., 2009;Westberry et al., 2013) did not exhibit any coherent spatial patterns and only weakly correlated to our modelled DMS concentrations (ρ=0.15 for both RFR and ANN, respectively).

Discussion
The relative sparsity of DMS data in many oceanic regions and the complexity of DMS cycling have limited previous attempts to model oceanic distributions of this compound (Simó and Dachs, 2002;Vallina and Simó, 2007;Galí et al., 2018;Watanabe et al., 2007;Herr et al., 2019).Taking advantage of expanding data resources, we employed a new approach to statistically describe DMS distributions in the NESAP.Our results show that both our RFR and ANN models substantially improved predictive strength over traditional empirical approaches (Fig. 2, 3), while identifying several key DMS relationships and regional patterns across the NESAP (Fig. 8, 9).Although our statistical approach does not directly elucidate the underlying mechanisms driving these relationships, we can nonetheless make some reasonable inductive inferences.These inferences are discussed below, along with the implications of the improved predictive performance observed here.

Relationships with other oceanographic variables
Among the more prominent spatial relationships we observed was the coherence between predicted DMS concentrations and SST, and the negative correlation between predicted DMS concentrations and sea surface nitrate (SSN) within and surrounding the Alaska Gyre (Fig. 6-9).The DMS-nitrate relationship may be partially explained by the so-called sulfur overflow hypothesis (Stefels, 2000), which suggests that nutrient-limited phytoplankton increase DMSP production and its subsequent cleavage to DMS, in order to regulate intracellular sulfur quotas when protein synthesis is limited (Hatton & Wilson, 2007;Kinsey et al., 2016;Simó & Vila-Costa, 2006;Spiese & Tatarkov, 2014;Stefels, 2000).This pathway may help explain the higher predicted DMS concentrations predicted at the northern extent of the Alaska Gyre, where SSN concentrations begin to decrease (Fig. 6).The apparent relationship between DMS and nitrate could also result indirectly from the underlying effects of iron limitation.Excess summertime nitrate concentrations are taken as evidence for iron limitation in the NESAP (Boyd and Harrison, 1999;Boyd et al., 2004;Martin and Fitzwater, 1988;Whitney et al., 2005).Under iron-limiting conditions, DMS is thought to function, together with DMSP and DMSO, as part of an antioxidant response to oxidative stress (Sunda et al., 2002).This hypothesis suggests that iron limitation should stimulate net production of DMS and DMSP (Bucciarelli et al., 2013;Sunda et al., 2002), which is inconsistent with the negative dependence predicted between DMS and SSN (Fig. 8b,c).
Satellite-based, chlorophyll-normalized fluorescence has been suggested as an additional proxy for iron limitation.Low iron conditions can lead to both a reduction in photosystem I relative to photosystem II (Strzepek and Harrison, 2004), and an apparent increase in energetically-decoupled light harvesting complexes (Allen et al., 2008;Behrenfeld & Milligan, 2013), resulting in elevated fluorescence-to-chlorophyll a ratios (nFLH:Chl-a) (Westberry et al., 2013).To our knowledge, this proxy has not been widely investigated with respect to DMS cycling.In our analysis, we found that nFLH:Chl-a ratios, and the NPQ-corrected fluorescence yields (φf), exhibited only weak positive correlations with the RFR and ANN predicted DMS concentrations (Fig. 6, 7).Moreover, neither of these metrics exhibited coherent spatial patterns with predicted DMS concentrations, suggesting a limited role for iron in driving spatial patterns of DMS cycling within the NESAP.However, it is important to note the potential temporal mismatch between our monthly DMS predictions and these more instantaneous metrics of iron limitation, which reflect short-term physiological changes (days to weeks; (Behrenfeld et al., 2009;Westberry et al., 2019) that depend on sporadic iron loading (e.g.aerosol deposition; Mahowald et al., 2009).Indeed, both natural and artificial iron-fertilization events have thus far been detected from satellite-derived nFLH:Chl-a at daily resolution (Westberry et al., 2013), in contrast to the monthly-averaged data used here.Therefore, modelling frameworks utilizing shorter temporal scales may find a clearer connection between DMS cycling and iron limitation using the chlorophyll-a fluorescence proxy.
Beyond nutrient limitation effects, ambient light fields are believed to exert significant direct and indirect effects on DMS cycling (del Valle et al., 2007).Ultraviolet radiation has been noted to induce high DMS production and turnover through a proposed cascading oxidation pathway, which acts to remove harmful reactive oxygen species (Sunda et al., 2002;Archer et al., 2010).In contrast, more recent evidence has indicated the potential for elevated DMS production in the NESAP from the reduction of DMSO due to light-induced oxidative stress over diurnal cycles (Herr et al., 2020).However, our modelled DMS concentrations exhibited a negative correlation with PAR (Fig. 6, 7), suggesting that incident light may predominantly drive DMS loss in the NESAP through photolysis (del Valle et al., 2007) on regional and longer-term scales.
Since DMS does not have strong light absorption properties, the presence of photosensitisers is necessary for the abiotic photooxidation of DMS (Brimblecombe and Shooter, 1986).To account for this process, our models incorporated nitrate (SSN) and acdm(443) (as a proxy for CDOM; Nelson & Siegel, 2013), both of which are thought to be dominant photosensitisers of DMS in marine systems (Taalba et al., 2013;Bouillon andMiller, 2004, 2005;Galí et al., 2016).In the NESAP, nitrate appears to exert a stronger influence than CDOM on the apparent quantum yields (AQY) of DMS (Bouillon and Miller, 2004).In support of this, our results suggest a stronger negative dependence of predicted DMS concentrations on nitrate compared to CDOM within the NESAP (Fig. 6, 7).We note, however, that the DMS-nitrate relationship likely also reflects physiological impacts of nutrient limitation, as discussed above.Nonetheless, our results are consistent with elevated rates of DMS photo-oxidation in the nitratereplete, low iron waters of the Alaska Gyre, where photolysis, coupled with potentially high DMS oxidation rates due to iron-induced oxidative stress (Sunda et al., 2002), may explain the low predicted DMS concentrations (Fig. 8, 9).Further in situ work will be required to resolve the relative contributions of these biotic and abiotic processes to DMS cycling within these areas.
Among all the statistical relationships we observed, perhaps the most striking was the association of DMS variability with SSHA, particularly along the Alaskan coast and in relation to mesoscale eddies (Okkonen et al., 2001;Whitney et al., 2005;Fig. 8, 9).To our knowledge, only one other study has linked SSHA to DMS within the NESAP.Herr et al., (2019)  physical mixing processes are important.For example, enhanced biological production is known to be stimulated by eddy re-supply of iron and macronutrients via vertical advection and diffusion (Whitney et al., 2005;Bailey et al., 2008).These nutrient supply processes would also be expected to influence DMS cycling, as outlined above.
Elevated abundances of high DMS-producers within eddies have been noted in the Sargasso Sea (Bailey et al., 2008), while eddy-induced vertical transport likely supplements nearshore, current-driven upwelling that can also resupply iron into the coastal waters of the NESAP (Cullen et al., 2009;Freeland et al., 1984).In addition, eddy propagation can allow cross-shelf transport, distributing micronutrients to offshore waters (Fiechter and Moore, 2012), potentially contributing to the apparent elevated DMS concentrations in the outer Alaska gyre between the 10.5 and 12 o C isotherms (Fig. 8).These mixing and transport mechanisms could partially explain the influence of elevated productivity in driving increased nearshore and northern NESAP DMS concentrations (Fig. 4, 7-9), representing a novel source of DMS variability in this region.
The taxonomic composition of plankton assemblages is also a likely source of variability influencing DMS cycling.Significant changes to DMS production and consumption rates within the NESAP are expected in response to variable microbial and phytoplankton taxonomy (Vila-Costa et al., 2006;Lidbury et al., 2016;Sheehan and Petrou, 2020).Such taxonomic variability may, in turn, reflect transient community composition shifts in response to mixing (Bailey et al., 2008), nitrate (Bouillon and Miller, 2004), and iron availability (Levasseur et al., 2006;Merzouk et al., 2006).The monthly averaging used in our data processing removes autocorrelation associated with individual sampling expeditions (Wang et al., 2020), but it may preclude capturing these transient taxonomic responses.For instance, coccolithophores have long been believed to influence DMS cycling in the NESAP (Herr et al., 2019;Asher et al., 2011), yet averaged calcite distributions did not yield increased predictive strength for DMS concentrations in our analysis (see Sect. 2.6).Similarly, applying a chlorophyll-a based taxonomic algorithm (Hirata et al., 2011;Zeng et al., 2018) yielded no further explanation of the DMS variability predicted.The influence of taxonomic composition thus remains cryptic within our modelling framework.

Implications of Improved Predictive Power
As noted above, both the RFR and ANN approaches demonstrate significantly improved accuracy, explaining up to 62% of observed DMS variability (Fig. 2, 3).This model performance is somewhat lower than that achieved in the prediction of methane fluxes (Weber et al., 2019) and dissolved inorganic carbon dynamics (Roshan and DeVries, 2017), where R 2 values ranging from 0.7 to 0.95 were obtained.Nonetheless, the dramatic accuracy improvement of our algorithms over traditional methods (Fig. 2, 3) encourages the further use of these techniques in modelling DMS distributions.Improved predictive accuracy provides opportunities to gain insight into the mechanisms driving DMS cycling.
Our approach has yielded accurate DMS predictions at a 4 to 40-fold higher resolution then previous algorithms (Simó and Dachs, 2002;Vallina and Simó, 2007;Galí et al., 2018;Watanabe et al., 2007), enabling the description of mesoscale patterns and processes (Fig. 8).Extending these methods to sub-mesocale resolution will enable investigations into the dependence of DMS on finer-scale hydrographic processes, particularly stratification and frontal dynamics, which have been increasingly linked to DMS cycling but remain unresolved mechanistically (Asher et al., 2011;Royer et al., 2015).Moreover, coupling machine learning algorithms with biophysical and tracer export models holds promise to resolve the contributions of eddy dynamics and upwelling intensity on DMS variability, likely through nutrient availability and physiological mechanisms (Asher et al., 2011;Bailey et al., 2008;Cullen et al., 2009).Recent work has also developed a new database of DMS apparent quantum yields (Galí et al., 2016).As the availability of these measurements increases, simultaneous mapping of both DMS quantum yields and concentrations will become feasible, enabling future studies to better parse out the contribution of photolysis, physical mixing, and biological drivers of DMS cycling.
Although used in a diagnostic capacity here, our statistical models also hold potential for prognostic applications.Frameworks utilizing shorter time scales will likely be able to detect underlying mechanisms driving observed diel cycling (Galí et al., 2013;Royer et al., 2016), even if the underlying mechanisms are still unresolved.We note, however, that caution will need be exercised as machine learning models have a tendency to overfit noise (Weber et al., 2019;Roshan and DeVries, 2017;Wang et al., 2020), thus requiring appropriately large training datasets and the use of known "future" observations to validate predictive accuracy in this context.
The significant variability in DMS cycling across oceanic regimes will likely also render predictions more successful at regional, rather than global, scales (Galí et al., 2018;Royer et al., 2015).Nonetheless, prognostic applications of these algorithms should be investigated to aid in the future development of improved mechanistic models.

Conclusions
We have presented a statistical approach to modelling DMS distributions, which provides significantly higher accuracy than traditional methods (Simó and Dachs, 2002;Vallina and Simó, 2007;Galí et al., 2018;Watanabe et al., 2007;Lana et al., 2011), and yields estimates of the summertime NESAP DMS sea-air fluxes to 0.5-2.0Tg S yr -1 in agreement with previous findings (Herr et al., 2019;Lana et al., 2011).Our results further underscore the importance of the NESAP to global DMS production and motivate further observations in  2 are available via the SOLAs project (retrieved from www.bodc.ac.uk/solas_integration/implementation_products/group1/dms/),where the DMS sea-air fluxes were calculated as described in Sect.2.3.The gridded climatologies produced from each algorithm in this study can be obtained at https://github.com/bjmcnabb/DMS_Climatology/tree/main/NESAP/Climatologies.Author Contribution.BM and PT designed the study.Model code was written and implemented by BM.BM prepared the manuscript with significant contributions from PT.
Competing Interests.The authors declare that they have no conflict of interest.
https://doi.org/10.5194/bg-2021-189Preprint.Discussion started: 13 August 2021 c Author(s) 2021.CC BY 4.0 License.Machine learning algorithms require large datasets for the training and testing process.Traditionally, DMS measurements were based on time-consuming ship-board analysis of discrete samples, resulting in sparse data coverage over much of the oceans.More recently, the development of several automated DMS measurement systems https://doi.org/10.5194/bg-2021-189Preprint.Discussion started: 13 August 2021 c Author(s) 2021.CC BY 4.0 License.

Fig. 1 .
Fig. 1.Sensitivity of RFR and ANN models to grid size resolution.DMS fluxes (green) and R2 values (red) derived from sensitivity tests of (a) RFR and (b) ANN models to pixels resolutions of 0.25-5 o .The negative R2 values observed at the lowest resolution (largest grid cells) indicate that the predicted values explain less variance than the overall mean of the dataset.
Fig. 2. Taylor Diagram showing comparative performance metrics of each individual Random Forest Regression (RFR) and Artificial Neural Network (ANN) model (1000-model ensembles) against multiple linear regression (MLR) and other statistical DMS models (See sections 2.1 and 2.4).The Pearson correlation coefficients ("Correlation"; outer radius), root mean squared error ("RMSE"; red radial contours), and standard deviations (SDs; grey radial contours from origin) are all computed with respect to the observed DMS samples after inverse hyperbolic sine (IHS) transformation.The reference of a perfect model fit is shown with a gold star.SDs of the model outputs are normalized to the SDs of the DMS observations.RMSE represents a normalized trigonometric derivation from both the correlation coefficients and normalized SDs.Performance of the SDO2, W07, VS07, and G18 algorithms reported here are calculated using regionally tuned coefficients to the NESAP derived from non-linear least-squares optimization (see section 2.4).

Fig. 3 .
Fig. 3. Performance of three modelling approaches in predicting observed DMS distributions; (A) multiple linear regression (MLR) (B) ensemble of Artificial Neural Networks (ANN) and (C) ensemble of Random Forest Regression (RFR).For consistency, all predictions are partitioned by the Training and Testing datasets used to build the ensembles (see section 2.2).Model performance (R 2 ) is computed only for the Testing dataset predictions.The dashed line demonstrates a 1:1 relationship.
Fig. 4. Predicted maps of sea surface DMS concentrations and sea-air fluxes.(a) Ship-based observations of mean summertime (June-August) DMS concentrations used to construct the predictive models.(b) Differences between the (c) Random Forest Regression (RFR) and (d) Artificial Neural Network (ANN) ensemble predicted DMS concentrations.(e,f) DMS sea-air fluxes derived from the predicted DMS concentrations.Colormap ranges are restricted to illustrate trends, with <1% of DMS data exceeding the colorbar limits.The inset map in (b) shows the NESAP study region as a shaded green patch in a global orthographic projection.
https://doi.org/10.5194/bg-2021-189Preprint.Discussion started: 13 August 2021 c Author(s) 2021.CC BY 4.0 License.Table 2. Monthly and mean summertime NESAP sea-air DMS fluxes.Fluxes (mean ± SD) are calculated from the Random Forest Regression (RFR) and Artificial Neural Network (ANN) model predictions (based on an ensemble of 2000 models).NESAP sea-air flux derived from the Lana et al. (2011) climatology is shown for comparative purposes.

Fig. 5 .
Fig. 5. Histograms of DMS sea-air flux distributions derived from the 1000-model ensemble random forest regression (RFR) and artificial neural network (ANN) predictions as well as cruise observations (Obs.).The sample sizes of both models are equivalent (n= 49,632) and are significantly higher than the observational dataset (n=2063).Note that the ANN better predicts the upper tail of DMS observations greater than 20 nM.

Fig. 6 .Fig. 7 .Fig. 8 .
Fig. 6.Principal Component Analysis (PCA) showing the relationships between variables used to construct the predictive algorithms.Eigenvectors (arrows) are superimposed over the principal components (PCs; data points) for the first two significant modes obtained from PCA. PCs are normalized and clustered by month (June-August, see legend for colors), while the eigenvectors are grouped by ensemble model predictions (gold) and nine predictor variables (black).The percentage of variance explained by each mode is indicated along the axes.

Fig. 9 .
Fig. 9. Predicted spatial and temporal (June-August) DMS distribution in relation to underlying oceanographic variables.DMS concentrations predicted from (a-c) the Random Forest Regression (RFR) and (d-f) the Artificial Neural Network (ANN) ensemble models are mapped alongside the monthly-averaged (g-i) sea surface height anomalies (SSHA), (j-l) wind speeds (Wind), and (m-o) sea surface nitrate (SSN) for each month.Colormap ranges are restricted to illustrate trends, with at most 1.5% of the data beyond the colorbar limits.
https://doi.org/10.5194/bg-2021-189Preprint.Discussion started: 13 August 2021 c Author(s) 2021.CC BY 4.0 License.traditionally under-sampled areas such as the Alaska Gyre and Aleutian Islands.Although we are unable to directly examine the mechanistic drivers of DMS variability, our findings suggest nutrient limitation, light-driven processes, and eddy-induced mixing are potentially key drivers of DMS cycling in the NESAP.Future studies will benefit from using such statistical algorithms, in conjunction with field-based process studies and mechanistic models, to better understand the underlying dynamics and driving factors in the oceanic DMS cycle.Code availability.The analysis in this study makes extensive use of the Numpy, Matplotlib, & Scikit-/github.com/bjmcnabb/DMS_Climatology/tree/main/NESAPor are available upon request from the corresponding author.Data Availability.DMS observations and predictor datasets are described in the Methods with relevant links to repositories.Data from the Lana et al. (2011) climatology used for comparison in Table

Table 1 . Data sources and spatial and temporal resolution of predictor variables used to develop the RFR and ANN algorithms. Data processing levels are indicated where relevant. All variables were used as predictors (excluding bathymetry) and post-processed to monthly-averaged, 0.25 o resolution (see sections 2.1-2.2).
https://doi.org/10.5194/bg-2021-189Preprint.Discussion started: 13 August 2021 c Author(s) 2021.CC BY 4.0 License.