IPB-MSA&SO 4 : a daily 0.25° resolution dataset of In-situ Produced Biogenic Methanesulfonic Acid and Sulfate over the North Atlantic during 1998 – 2022 based on machine learning

. Accurate long-term marine-derived biogenic sulfur aerosol concentrations at high spatial and temporal resolutions are critical for a wide range of studies including climatology, trend analysis, model evaluation, accurate investigation of their contribution to aerosol burden, or to elucidate their radiative impacts and to provide boundary conditions for regional models. By applying machine learning algorithms, we constructed the first, publicly available, daily gridded dataset of in-20 situ produced biogenic methanesulfonic acid (MSA) and sulfate (SO 4 ) concentrations covering the North Atlantic Ocean. The dataset is of high spatial resolution of 0.25° × 0.25°, spanning 25 years (1998 – 2022), far exceeding what observations alone could achieve both space-and time-wise. The machine learning models were generated by combining in-situ observations of sulfur aerosol data at Mace Head research station, west coast of Ireland, and from NAAMES cruises in the

IPB-MSA&SO4 data allowed us to analyze the spatiotemporal patterns of MSA, SO4, and the ratio between them (MSA:SO4).A comparison with the existing CAMS-EAC4 reanalysis suggests that our high-resolution dataset reproduces with high accuracy the spatial and temporal patterns of the biogenic sulfur aerosol concentration and has high consistency with independent measurements in the Atlantic Ocean.The IPB-MSA&SO4 is publicly available at https://doi.org/10.17632/j8bzd5dvpx.1 (Mansour et al., 2023b).

Introduction
Marine-derived biogenic sulfur aerosol particles exert an important influence on the radiative properties of the atmosphere, both directly by scattering solar radiation and indirectly by modifying cloud properties (Langmann et al., 2008;Charlson et al., 1987).Dimethylsulfide (DMS), a volatile organic compound produced by marine phytoplankton, is the main precursor of biogenic sulfur-containing aerosols in the marine boundary layer (MBL).After being ventilated into the atmosphere, DMS is oxidized to form two of the major marine aerosol species, Methanesulfonic acid (MSA) and non-sea-salt sulfate (nss-SO4 2-).
Throughout the present study, we abbreviate the nss-SO4 2-concentration as SO4 and MSA concentration as MSA, for simplicity.Sulfur emitted by marine organisms constitutes 20% (Fiddes et al., 2018) to 40% (Simo, 2001) of the total sulfur burden of the atmosphere.The understanding of the role of MSA and SO4 concentrations in Earth's climate is elusive (Mansour et al., 2020a;Hodshire et al., 2019).According to the CLAW hypothesis (Charlson et al., 1987), negative climate feedback is expected to occur if phytoplankton responds to elevated temperature and solar radiation levels by increasing their DMS production, thereby, exerting a cooling effect by increasing planetary albedo.Indeed, studies confirmed that DMS emissions contribute significantly to stabilizing the Earth's atmosphere (Sanchez et al., 2018;Thomas et al., 2010;Kim et al., 2018;Mahmood et al., 2019;Mansour et al., 2022;Mansour et al., 2020b), while a few others have claimed that the biological control over cloud condensation nuclei (CCN) goes even beyond the CLAW's climatic feedback role of DMS (Quinn and Bates, 2011;Woodhouse et al., 2010;O'Dowd et al., 2004).As a result, biogenic sulfur aerosols play a central role in ocean-atmosphere interactions and regional climate change, and it is critical to parameterize and characterize biogenic MSA and SO4 across different sea areas to constrain the past, current and future climate impacts of both species (Hodshire et al., 2019;Gondwe et al., 2003).

2021
).This level of uncertainty underlines the need for improved parameterizations of natural sulfur aerosol cycling and fluxes at regional scales (Hulswar et al., 2022;Gali et al., 2018;Mahajan et al., 2015), which is essential for determining their impact on climate.
Focusing on the North Atlantic (NA) Ocean, sulfur-containing aerosols, MSA and SO4, have been measured at Mace Head sampling station, a coastal area in the eastern NA Ocean, to quantify the contribution of phytoplankton emissions to aerosol mass concentrations in MBL (Rinaldi et al., 2010;Rinaldi et al., 2009;O'Dowd et al., 2004), to assess the long-term seasonal patterns in the chemical composition of submicron aerosol in the different origin of marine air masses (Ovadnevaite et al., 2014), and to identify the oceanic regions acting as the main source of biogenic aerosols (Mansour et al., 2020b).During NAAMES field campaigns, research cruises aimed at comprehending the relationships between ecosystems, aerosols, and clouds (Behrenfeld et al., 2019), Saliba et al. (2020) evaluated the origins and contributions of submicron organic and sulfate components to CCN concentrations in the MBL.They concluded that the DMS-derived secondary SO4 enhanced hygroscopicity, particle size, and CCN concentrations by 5-66%, especially in the spring, highlighting the importance of phytoplankton produced DMS emissions for the CCN budget in the NA (Mansour et al., 2022;Mansour et al., 2020b;Sanchez et al., 2018).However, it is currently challenging to effectively investigate climatology, long-term trends and climate forcing of biogenic sulfur compounds, as well as validate inherent model outputs, since there is a lack of high-time resolution data on these compounds.
In this study, we present the first high-resolution and long-term daily gridded time series of freshly formed In-situ Produced Biogenic Methanesulfonic Acid and Sulfate (IPB-MSA&SO4) concentrations over the NA ocean at 0.25° × 0.25° spatial resolution.The data covers 25 years from 1998 to 2022 with the possibility of future updating year by year.We created the IPB-MSA&SO4 dataset using in-situ MSA and SO4 data measured at Mace Head (MHD) site and from NAAMES cruises, the gridded dataset from the ECMWF-ERA5 together with the constructed FDMS (Mansour et al., 2023a) as input data.To achieve this aim, we employed machine learning (ML) approaches: support vector machines (SVM), regression ensemble (RE), Gaussian process regression (GPR), and artificial neural networks (ANN).ML has been applied in a variety of scientific areas for model approximation, experiment design, and multivariate regression of oceanic and atmospheric complex systems, however, no prior applications to MSA and SO4 prediction have been published, to our knowledge.During model training, we evaluated the various possible kernel functions and hyperparameters in each ML type (details in Table S1), employing the 5-fold cross-validation strategy to select the best-performing (optimal) function capable of properly predicting MSA and SO4.The partial dependence analysis is also used to assess the effect of different predictors on the modeled MSA and SO4.Furthermore, we investigate the monthly spatial distributions of MSA, SO4 and the ratio between them (MSA:SO4) to examine the monthly evolution of MSA and SO4 in the different regions of the NA domain from 1998 to 2022.The output data (IPB-MSA&SO4) from this study should be useful for filling the data gap, particularly for the NA, and be applicable to a variety of investigations, such as climatology, trend analysis, model evaluation, radiative impacts, and providing boundary conditions for regional models.

Study area and measuring sites
The study area extends from 20° to 66° N and from 72° W to the prime meridian (Fig. 1) covering the NA Ocean.The key climate-relevant features in the study domain are the Atlantic meridional overturning circulation (AMOC) (Buckley and Marshall, 2016) and the cyclonic subpolar gyre (SPG) (Rhein et al., 2011).AMOC is a major current system of the NA transporting the warm and salty surface waters toward the North and the cold deep waters toward the South.The NA SPG extends from 45° N to around 65° N and comprises the sills between Greenland, Iceland, the Faroe Islands, and Scotland.
The SPG is a crucial region for the modulation of the temperate climate of north-western Europe (Marzocchi et al., 2015), and its dynamics determine the rate of deep and intermediate water formation (sinking dense and cold surface waters through air-sea heat exchanges in wintertime) particularly in the Labrador Sea (Katsman et al., 2004).Both phenomena contribute to the regional changes in biological activity and subsequent emissions in the study domain.
The MHD global atmospheric watch (GAW) research station (53.33°N, 09.90° W) is located on Ireland's west coast (Fig. 1), at about 80 meters from the coastline and 21 m above mean sea level.MHD is the only GAW station in the eastern Atlantic region and is the globally acknowledged clean background western European station, providing key baseline input for intercomparing with levels elsewhere in Europe (Grigas et al., 2017;O'Dowd et al., 2014).
Four shipboard field campaigns were carried out as part of the NAAMES research project (Behrenfeld et al., 2019).The tracks of cruises representing marine conditions during aerosol sampling (Saliba et al., 2020)

Observational data
The long-term submicron sulfur aerosol species atmospheric concentrations (Methanesulfonic acid [MSA] and Sulfate [SO4]) from January 2009 to June 2018 measured at MHD were used.The measurements were performed by using the Aerodyne High Resolution-Time of Flight-Aerosol Mass Spectrometer (HR-ToF-AMS).The HR-ToF-AMS (Decarlo et al., 2006) output has a time resolution of ~5-10 minutes and it was operated according to the recommendations by Jimenez et al. (2003), Allan et al. (2003) and Canagaratna et al. (2007).The MSA was derived from the concentration of mass fragment CH3SO2 + (Ovadnevaite et al., 2014).Further information on the MSA measurement can be found in Mansour et al. (2020a).
The black carbon (BC) concentrations were measured in-situ at MHD by a multi-angle absorption photometer (O'Dowd et al., 2014) to identify the anthropogenically impacted air masses, as detailed in Section 3.1.1.(Saliba et al., 2020).We employ the SO4 concentrations, whereas there are no high-resolution MSA datasets available from NAAMES campaigns, during periods that were largely marine aerosol sources which were defined as periods when particle number concentrations <1500 cm −3 , BC <50 ng m −3 , 2-days back trajectories originated from the North or tropical Atlantic, and radon concentrations <500 mBq m −3 according to Saliba et al. (2020).The measured SO4 from AMS excludes refractory particles that likely contain the majority of sea-salt sulfate which is therefore approximately equivalent to nss-sulfate (Frossard et al., 2014).

Air mass back-trajectories
The Air Resources Laboratory (ARL) of the National Oceanic and Atmospheric Administration (NOAA) developed the Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT4) model (Rolph et al., 2017;Stein et al., 2015), which is used to calculate the air mass back-trajectories (BTs).The archived Global Data Assimilation System (GDAS1) (1° × 1°) of the National Centers for Environmental Prediction (NCEP) was used as a driver of the trajectory calculation (ftp://arlftp.arlhq.noaa.gov/pub/archives/gdas1).We run the model at the MHD sampling station as a fixed source location and throughout the NAAMES cruises as a moving source location.The starting height is set to be 100 m above ground level and the backward time is 3 days with an interval of 1 h along each entire trajectory track.The schematic diagram of BTs calculation is shown in Fig. S1.The arrival frequency of BTs at MHD is 3h (eight tracks a day) covering the period from 01-Jan-2009 to 30-Jun-2018 and of NAAMES is hourly (twenty-four tracks a day) covering the time of the four campaigns identified as marine periods (Saliba et al., 2020).

Dimethylsulfide flux data
The seawater DMS is the primary contributor to biogenic sulfur aerosol in the atmosphere.For this reason, we use the sea-toair DMS flux (FDMS) as a predictor of MSA and SO4 concentrations.Mansour et al. (2023a) used an ML predictive algorithm based on Gaussian process regression (GPR) to simulate the distribution of daily seawater DMS concentrations and related FDMS in the NA areas from 35° to 66° N and from 0° to 55° W at 0.25° × 0.25° spatial resolution.We extended the GPR model within the NA to encompass the NAAMES measurements, which are essential because they cover the western most section of the study area.Fig. S2 displays the main differences between the two domains.Simply, the GPR was trained once more, utilizing the same approach of Mansour et al. (2023a), with a higher number of data points and yielded an enhanced R 2 value up to 0.77 on the independent test dataset.The daily sea-to-air FDMS was calculated using the gas transfer velocity (Goddijn-Murphy et al., 2012) and the DMS derived from GPR predictions.For more details about the data product, we refer the reader to Mansour et al. (2023a).

Meteorological data
The ECMWF-ERA5 reanalysis data (Hersbach et al., 2020) were downloaded to extract the meteorological parameters used as predictors of MSA and SO4 in the ML models.ERA5 provides estimates for the hourly state of the atmosphere, worldwide, with spatial resolution 0.25° × 0.25° at the surface and different pressure levels.From the global domain, we extracted multiple atmospheric components including air temperature at 2m above sea level (AT) and surface net short-wave radiation flux (SRF) as representative of thermal heating, and the relative humidity (RH) as representative of water vapor abundance in the atmosphere.To represent the dispersion of aerosol particles in the troposphere and the wet removal through the below-cloud scavenging process, the boundary layer height (BLH) and the precipitation rate (PR) were utilized, respectively.

Data preparation
In this Section, we describe the preparation of predictors and responses that were used to train, cross-validate, and generate ML models.

Air mass selection
In previous studies (Mansour et al., 2020b;O'Dowd et al., 2015;Ovadnevaite et al., 2014), BC concentration was often considered as a useful tool to select clean marine air masses excluding inputs from continental emissions or ship trails.In this study, we still relied on BC measurements as a precious tool to identify and exclude anthropogenically impacted air masses, but we also developed a more complete approach aimed at identifying air masses characterized by a high degree of contact with the ocean surface.This was necessary in order to select, from the in-situ observations, data points representing almost entirely oceanic sources to provide the best dataset for training the ML models.
The retention ratio of the air mass over the ocean (  ) was calculated to determine whether an air mass (identified by BT track) arriving at the MHD sampling station or at the ship location in the case of shipborne measurements was primarily from the NA region or not.We used 3-day BTs arriving 100 m above the MHD sampling station and NAAMES tracks.The BTs tracks at the MHD arrival point were calculated 8 times per day, whereas it was 24 times per day at NAAMES measuring points, considering only the measurements classified as marine periods (Saliba et al., 2020).The   has been calculated for each track as: where   is the total number of trajectory endpoints which is equal to 73 (arrival point + 72 backward hours).  is the total number of trajectory endpoints passing over the ocean, while   is the backward tracking time with the unit of an hour spanning the values from 0 to 72.Because air mass diffusion and particles deposition potentially occur during the air mass transport, a weighting factor  −  72 ⁄ related to tracking time has been introduced.The weighting factor takes the values from 1 (at the arrival point) up to 0.37 (farthest point), hence, the oceanic areas far from the arrival point, corresponding to longer backward tracking time, have a weaker influence than areas closer to the sampling point.As a result, a higher   value implies that oceanic emissions have a greater influence on the air mass and that the source region is more likely to be the ocean.Other studies have used similar methods to characterize air mass source regions.For example, Zhou et al. (2021) studied the contribution of non-marine MSA sources in the coastal East China Sea and the Gulf of Aqaba by characterizing the land air masses.Rinaldi et al. (2021) used a combination of low-travelling air mass BTs and satellite ground-type maps to investigate the effect of ground conditions (sea ice, snow, seawater, and land) on air samples at Ny-Ålesund station in the Arctic Ocean.
Because oceanic air masses crossing the NA can pass above the BLH, its connection to local sea surface processes such as marine biogenic emission and subsequent atmospheric reactions may be significantly weaker.To address this issue, Eq. 2 was used to calculate the retention ratio of an ocean air mass within the marine boundary layer (  ).
where   is the total number of trajectory endpoints located over the ocean (i.e., marine endpoints) and   is the number of marine endpoints which have an altitude below BLH.The higher the   value, the more airflow over the ocean is confined to the MBL.The BLH datasets at each endpoint were extracted from the hourly ERA5 dataset.
The total number of BTs tracks arriving at MHD during the period from Jan-2009 to Jun-2018 is 27,744 (3468 days × 8 tracks per day).We counted the number of endpoints of all BTs in each 1° × 1° grid cell and normalized them to the maximum value to find the percentage of endpoints for all grid cells (Fig. S3).The larger density of BTs endpoints is concentrated over the NA oceanic region, indicating that the main source regions for air masses transported to MHD sampling stations are most likely oceanic.At MHD, we investigated how MSA (a marine biogenic tracer) responds to change in BC (a tracer of anthropogenic input) as seen in Fig. S4, by considering hourly data simultaneous to the arrival time of BTs (i.e., 8 times a day).We found that MSA tends to fluctuate minimally when BC is less than 15 ng m -3 (slope = 0.05), whereas MSA tends to rise slightly when BC exceeds 15 ng m -3 (slope = 0.28).Such cases with hourly BC concentrations <15 ng m −3 were classified as representative of marine conditions, that are likely not influenced by anthropogenic sources.
To constrain the impact of marine biogenic emissions and meteorological parameters on MSA and SO4, air masses were included in this analysis only if they were characterized by   +   ≥ 1.75, meaning that the air mass had a high degree of contact with the ocean surface within the last 3 days (Fig. S4).Indeed, considering the above condition, an air mass must have at least   equal to 0.75 and in such case the track must be traveling 100% of the time below the BLH.By introducing the criterion of   +   ≥ 1.75, approximately 72% of the BTs tracks were considered.This reflects the significance of the MHD research station for studying NA biogenic emissions, and the frequency with which it is impacted by MBL air masses (Grigas et al., 2017;O'Dowd et al., 2014).After considering the BC threshold (<15 ng m −3 ) and conservatively removing all the observations done when the BC data were unavailable (instrument downtime), 9211 (33% of the total) tracks were classified as representative of marine conditions (selected marine BTs frequency is presented in Fig. S5).
Regarding the NAAMES measurements, the total number of calculated BTs tracks was 832 (Fig. S6) during background marine conditions, identified by Saliba et al. (2020).In this study, we kept 660 tracks (Fig. S7) of the above 832 as representative samples of marine conditions during NAAMEAS cruises by limiting the analysis to hourly samples with   +   ≥ 1.75.

Predictors extraction along back trajectories
In order to train the ML models, it was necessary to associate each observed MSA and SO4 data point with the corresponding potential predictors.The potential predictors (FDMS, AT, SRF, RH, BLH and PR) were extracted at each endpoint of the BTs associated with each of the selected clean marine observational data points (see Section 3.1.1),inside the oceanic region within 20−66 °N and 0−72 °W (Fig. S1).The extracted predictor values were then averaged along each marine BT track, providing the most representative picture of the conditions (air mass history) that led to the formation of the observed sulfur aerosol concentrations.The few endpoints over land or crossing above the BLH were eliminated.The Pearson's correlation coefficients between the potential predictors and observational MSA and SO4 data were compared, considering different BT lengths of 1, 2 and 3 days, to assess which BT length was more representative of the time scale of sulfur aerosol formation processes.As seen from Table 1, both MSA and SO4 correlate better with FDMS considering a 3-day BT length.Similarly, the majority of the other predictors, except for AT, tended to maximize their correlations considering 2 or 3 days of BT length.Ultimately, we considered for each predictor the BT length that maximized the correlation coefficient for the analyses in the present study.Hourly SO4 at MHD and from NAAMES campaigns as well as MSA at MHD, measured concurrently with the selected marine BTs (Section 3.3.1),were used to build ML models.A total of 6162 (6920) data points for MSA (SO4) were obtained.Further, we also applied 0.1 and 99.9 percentiles lower and upper thresholds filter to remove the extremely low and high values that could bias the ML models training and cross/validation.This helped to identify and remove outliers in each dataset, thereby reducing the number of data points to 6150 (6905) for MSA (SO4) (∼0.2 % of data points were rejected).260

Machine learning models
The methodological flowchart of the present study is shown in Fig. 2. The core of the framework is using the supervised ML regression techniques to build predictive models for estimating the atmospheric concentrations of biogenic MSA and SO4 (responses) from independent variables (predictors).Predictors include the sea-to-air FDMS and meteorological parameters that control the aerosol concentration in the MBL.Given that ML models may be generated even if there is no physical relationship between predictors and responses, we used multilinear regression to assess the contribution of each predictor to MSA and SO4 variations.Initially, we ran the multilinear regression model using the total of the potential six predictors: FDMS, AT, SRF, RH, BLH and PR.Secondly, we applied the multilinear regression models by eliminating one predictor each time.Each independent variable's contribution to R 2 is the reduction in total R 2 when that variable is eliminated.The results (Table 3) showed that the six predictors used can explain up to 74% (53%) of MSA (SO4) variance.Such predictors tend to contribute differently to MSA and SO4.SRF, FDMS and BLH are the most effective parameters for MSA (explaining up to 64 % of the variability), while SRF, AT and FDMS are the most influential on SO4 (explaining up to 44 % of the variability).RH has a minor contribution to the MSA and SO4 variance.To know if a predictor contributes significantly to the explained variance, we performed the analysis of variance (ANOVA) on the implemented multilinear regression model.The ANOVA revealed that all the tested predictors have statistically significant (p < 0.05) contributions to MSA and SO4.For these reasons, we applied the ML models using all of the potential six predictors.    2 ).The model is then fit on the training set (4 folds) and evaluated on the validation set (last fold), and the average evaluation measures (accuracy) on the validation subsets of the five iterations are reported.To better examine the model's repeatability on a new independent dataset, the generated models were evaluated on the test data that was not included in the model construction.
Four types of ML models were trained/cross-validated and evaluated to identify the best-performing model in estimating sulfur aerosol concentrations (MSA and SO4).The ML algorithms are SVM, RE, GPR, and ANN.These are the most common types of algorithms, but still, there are subtypes where advanced options and optimizations in the model can increase the performance and resilience of the algorithms.In general, each supervised ML model performs differently and has various strengths and shortcomings.Finding the proper ML algorithm is largely based on trial and error; even experienced data scientists cannot anticipate if an algorithm will work without testing it.Thus, understanding the fundamentals of various ML algorithms and their applicability in diverse applications is critical (Sarker et al., 2019).As a result, initially, we assessed 17 algorithms belonging to the aforementioned four types and chose the most fitted from each type (Tables S1 and S2), as detailed in the following Sections.

Support vector machines (SVM)
SVM is a powerful mathematical model based on the statistical learning theory (Vapnik, 2013) that can be used either for classification or regression analysis.In recent decades, SVM demonstrated high prediction accuracy in a wide range of regression problems in fields such as oceanography, meteorology, and atmospheric sciences (Lins et al., 2013;Sachindra et al., 2018;Shabani et al., 2020;Shrestha and Shukla, 2015;Fan et al., 2018).The SVM model estimates the regression using a series of kernel functions that are capable of implicitly converting the original, lower-dimensional input data to a higherdimensional feature space.To achieve the best prediction accuracy for MSA and SO4, we assessed the SVM different kernel functions such as linear, polynomial (quadratic and cubic) and Gaussian (Table S1 and S2).The Gaussian kernel was https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.

Regression ensemble (RE)
The ensemble is a technique that employs a collection of models (referred to as weak learners or base models), each of which is produced by applying a learning process to a specific problem and then combining them to provide the final prediction (Mendes-Moreira et al., 2012).The performance and accuracy of ensembles are determined by the aggregation of weak learners (Hengl et al., 2018).The well-known types of aggregation are the bagging and boosting methods (Breiman, 2001).
In the bagging method (also known as bootstrap aggregating), the base models are generated using random sub-samples drawn from the original dataset with the bootstrap sampling method, where some original examples appear several times while others do not appear at all.On the other hand, the main idea of the boosting method is that it is possible to convert a base model that performs slightly better into one that arbitrarily achieves high accuracy.This conversion is performed by combining the estimations of several predictors.For more information on RE, the reader is referred to https://www.mathworks.com/help/stats/fitrensemble.html.

Gaussian process regression (GPR)
GPR is a non-parametric technique for solving nonlinear regression problems (Williams and Rasmussen, 1996) which is based on Bayesian theory and statistical learning theory.The accuracy of GPR is dependent on the adopted kernel (covariance) functions (Verrelst et al., 2016).We assessed the different base kernel functions, namely exponential, Matern 5/2, squared exponential, and rational quadratic (Asante-Okyere et al., 2018;Mansour et al., 2023a) to determine the optimal covariance function that could produce reliable predictions of MSA and SO4.For more information on GPR, the reader is referred to Mansour et al. (2023a) and https://www.mathworks.com/help/stats/fitrgp.html.

Artificial neural networks (ANN)
ANN is an information processing system, which can be used to understand the complex nonlinear relationship between the response and predictors (Kalogirou, 2001).It consists of interconnected groups of artificial neurons that work in the same way as biological neurons.The ANN structure comprises three distinctive groups called input (corresponds to the predictors), several hidden layers (fully connected), and output (corresponds to the predicted response values).The input introduces data to the ANN model, the hidden layer processes the data, and the results are produced in the output.Further details on ANN can be found at https://www.mathworks.com/help/stats/fitrnet.html.We trained various types of ANN as single-layer (number of fully connected layers = 1), bi-layered (number of fully connected layers = 2), and tri-layered (number of fully connected layers = 3) neural networks as detailed in Tables S1 and S2.

Evaluation measures
In this study, we use different validation metrics to evaluate the ML models' performance.Each of the metrics is calculated using "residuals".Residuals are the differences between the observed data points   and the predicted values   , where  = 1,2, … . refers to the number of observations.Better models in predicting the response have residuals close to zero.The average magnitude of the residuals is called mean absolute error (MAE). (3) Regression models tend to use the square of the residuals instead of the absolute.The square root of the average of the squared residuals is called root mean square errors (RMSE).A low RMSE is a confidence that your model has relatively few large errors.
The metrics listed in Eqn. 3 and Eqn. 4 can only tell you how a model compares to observations and/or other models.Neither can say whether a model is a good fit for the data objectively.Comparing a model to a simple baseline model is a different approach.This is the motivation behind the use of the coefficient of determination ( 2 ) metric (Eqn.5). 2 is the relative difference in the total error obtained by fitting a model, so a value between 0 and 1.If a model fits the data well, the model error is small and  2 will be close to 1 and vice versa.
Where   ̅ is the average of observations.

Evaluation of ML model performance
As a first step, we assessed different possible hyperparameters optimization in each type of the four used ML models (SVM, RE, GPR, and ANN) to determine which one has the best fit and lesser errors in sulfur aerosol (MSA and SO4) predictability.
We chose the best model with the least errors in each type for further evaluation and analysis based on the evaluation measures (RMSE, MAE, and R 2 ).The evaluation measures are summarized in Table S1 for MSA and Table S2 for  connected layer is selected.The four best-performing (optimal) models have been exported and saved so that they can be used to make new predictions on a new dataset.Importantly, the implemented ML models can reconstruct MSA and SO4 daily time series characteristics with remarkable consistency between observed and predicted data.It is worth noting that the daily averages of MSA and SO4 have been calculated from the validation folds and the test set.The MAE of GPR is close to 0.014 (0.100) µg m -3 for MSA (SO4).The MAE of EBT, SVM and ANN are higher than those of both GPR.According to the R 2 , the ranking order is the same as for MAE, i.e., GPR outperforms EBT, SVM and ANN in both MSA and SO4, notwithstanding the differences in the R 2 of the four models are small.An in-depth look at the MAE and R 2 from MHD and NAAMES (Fig. 4; right panels) demonstrates that the ML models perform well in predicting SO4 across different datasets.All four models show relatively high values of R 2 on the NAAMES dataset.EBT, SVM and ANN have R 2 values that are similar and equal to 0.81, whilst GPR has a higher value of R 2 reaching 0.87.In essence, the performance metrics indicate that GPR always has the highest accuracy and lowest errors, reflecting the robustness of GPR.Therefore, GPR was selected as the optimal regressor for further analysis throughout this study.

Partial dependence analysis
The bulk of ML models is called a "black box" since the internal computations inside multiple operational layers in a model are concealed and most systems have only observable inputs and outputs out of the box.The partial dependence analysis (Friedman, 2001) is used to assess how predictors influence an output by ML model and show whether the relationship between the response and any of the features is linear, monotonic or more complex.The method entails altering one feature and constraining the remaining features to unaltered average values to illustrate the marginal effect of the changed feature on the expected outcome.The partial dependence plots of MSA and SO4 as a function of the predictors in the highestperforming GPR model are shown in Fig. 5, indicating that the interactions between predictors and response are complex in general.MSA and SO4 levels tend to rise as FDMS levels rise from 3 to 10 µmol m -2 d -1 .MSA continues to rise with stronger FDMS emission rates (>10 µmol m -2 d -1 ), nevertheless, SO4 concentration appears independent of FDMS after this threshold.
AT exhibits a positive relationship with MSA and SO4 concentration in the range of (5-10 °C) and above a downward trend.
RH, which has the least impact on MSA and SO4 (Table 3), has an unclear pattern on the MSA and SO4 marginal changes.
MSA and SO4 present a negative dependence on PR as rain is expected to scavenge aerosol particles; nevertheless, at higher levels of PR, SO4 concentrations tend to increase.This may be partly linked to enhanced cloudiness, associated to high PR, where the aqueous phase formation of SO4 in the MBL may be favored (Zhu et al., 2006;Von Glasow and Crutzen, 2004).This is also in agreement with the enhancement of SO4 concentration at high RH.Finally, BLH and SRF are the most straightforward influencing parameters on MSA and SO4 levels, with deep BLH resulting in a dilution of their concentrations and high SRF leading to high MSA and SO4 levels, as expected for DMS photo-oxidation products.

The IPB-MSA&SO4 dataset
The GPR model was used to generate the long-term gridded fields of high-resolution (0.25° × 0.25°) MSA and SO4 concentrations.At each pixel, a daily time series of MSA and SO4 have been generated spanning from 1998 to 2022 (9131 days).The total number of pixels in the entire NA domain is 43840, for a total of 400'303'040 data points.The daily time series of MSA and SO4 averaged over the entire NA domain are presented in Fig. S8.The dataset represents the sea-level concentrations of MSA and SO4 associated with in-situ production in the MBL derived based on the six selected predictors, which in turn represent the sea-to-air flux of DMS (the precursor) and the meteorological conditions that can mostly affect, in one direction or in the other, the formation of the two products.For this reason, we consider the data to be representative of the concentration of sulfur aerosol species resulting, in each pixel, from the local biogenic emissions in combination with local atmospheric conditions.As such, we called the achieved data product the In-situ Produced Biogenic MSA and SO4 (IPB-MSA&SO4) dataset across the NA.It is important to note that atmospheric motion is not considered in our product and that the maps resulting from the data represent a static picture of potential sea-level concentrations of MSA and SO4, in a certain pixel and at a certain time as a result only of the interplay between local DMS emissions, photochemistry and dilution/removal processes, and that provide accurate predictions of the actual sea level concentrations of MSA and SO4

Comparison with CAMS Reanalysis
To further examine the effectiveness of our GPR model, we compared the observed MSA concentrations at MHD with the most recently released CAMS-EAC4 (Inness et al., 2019)  Scatter plots and joint probability histograms of residual errors (Fig. 6) were constructed to compare the accuracy between GPR, CAMS and observations (referred to as OBS).It can be seen from the scatter plots (Fig. 6a and Fig. 6b) that the GPRsimulated MSA best matches the observations, with a 1.03 fitted slope, 0.93 correlation coefficient and most of the data points comprised within the 95% confidence bounds.The joint probability histograms between observed MSA and the residuals (OBS -GPR) and (OBS -CAMS) are used to verify the variance of residual errors around zero.The GPR histograms (Fig. 6c and Fig. 6e) show that the residual errors are mostly centered around zero (dashed black line in the right) up to the value of 0.1 µg m -3 where the majority of data points lie, while CAMS are skewed toward negative residuals followed by positive residuals mainly at high MSA values (Fig. 6d and Fig. 6f).Quantitively, the GPR has relative MAE equal to 4.3% in comparison to 6.3% for CAMS.In summary, GPR better captures the low concentrations of MSA, which CAMS tends to overestimate, while both CAMS and GPR show limitations in retrieving the extreme points of MSA concentrations.A quantitative statistical analysis (Fig. 6g) showed that no statistically significant (p<0.05)difference exists between the seasonal median MSA from OBS and GPR, while CAMS presents a significant (p<0.05)difference in all seasons except summer.Nevertheless, the two datasets (GPR and CAMS) properly retrieve the observed MSA seasonal cycle.

Comparison with the Polarstern cruise results
In this Section, we present a case study exemplifying how the IPB-MSA&SO4 datasets can be used.Because the data product represents the concentration of freshly formed sulfur aerosol species and the ML model does not account for atmospheric transport, users must interpret the datasets considering the air mass history.To better clarify the idea, we employed the independent MSA data measured during the Polarstern campaigns in the NA (Huang et al., 2017), which were not used in the training/validation or testing/evaluation of the ML models, and compared them with predicted MSA by GPR.
In particular, the MSA by GPR was extracted along air mass BTs arriving at the hourly sites of the ship tracks and then averaged considering a 0-day (simultaneously), 1-day, 2-day and 3-day air mass history.The MSA measurements on Polarstern were performed in four scientific cruises including two spring seasons (April-May 2011 & April-May 2012) and two autumn seasons (October-November 2011 & October-November 2012).The ship tracks of the cruises from which the data were taken in the present study are shown in Fig. 7.It can be seen that the best match between GPR-simulated MSA and observed MSA occurred when 2-day air masses were considered.At 2-day air mass history, the slope reached 0.84 and the correlation coefficient 0.81 (Fig. 7a-d).Again, as seen in Fig. 7f, GPR MSA is considerably more consistent with observations than CAMS, for which a significant difference with observations (p < 0.05) can be appreciated.

Monthly MSA and SO4 distributions
In order to elucidate the geographical distributions of biogenic sulfur aerosol production across the NA domain, the IPB-MSA&SO4 datasets in the 25 years  were averaged to obtain the climatic monthly distributions of MSA and SO4 illustrated in Fig. 8a and Fig. 8b, respectively.The monthly climatological maps reveal that MSA and SO4 display a gradual increase in their concentrations southward, clearly evident from October to March, resulting in a large difference between the northern and southern parts of the domain.In contrary, during summer, the concentrations are more homogeneous over the domain (see latitudinal patterns in Fig. 9), still with a tendency to higher concentrations over the northeastern part.The seasonality of MSA and SO4 is evident: the increase for both compounds starts in April and peaks in June-July followed by a gradual decrease in September (Fig. S8).The lowest MSA (SO4) concentration occurs in December at 0.006 ± 0.005 (0.155 ± 0.079) and the highest occurs in June at 0.029 ± 0.013 (0.364 ± 0.075) µg m −3 (Fig. 8a and Fig. 8b), consistent with the fact that winter and summer are typically the lowest and highest seasons for biological activity, respectively for the NA (Mansour et al., 2023a).
The ratio of MSA to SO4 (MSA:SO4) also exhibits a seasonal pattern, with the lowest (highest) values observed during the winter (summer), as presented in Fig. 8c.July has the highest spatial average of the ratio of 0.077 ± 0.022 while the lowest of 0.032 ± 0.012 occurs in December.Looking at the overall distributions, MSA:SO4 demonstrates a general southern increase, with the exception of summer months.In summer (mainly July and August), MSA:SO4 above 50°N has an opposite trend with respect to the one below 50°N.In detail, from North to South, we report a sharp increase in MSA:SO4, maximized around 50°N, followed by an abrupt decrease toward the equator.The possible explanation for the decline in MSA:SO4 below 50°N is that the reduction in MSA:SO4 correlates to an increase in AT caused by warmer air nearing the equator, in line with observations in the Pacific Ocean (Bates et al., 1992) and with the higher ratio observed in colder air masses (marine Polar and Arctic) with respect to warmer ones (marine Tropical) at MHD (Ovadnevaite et al., 2014).As a final remark, we report that the summertime low MSA:SO4 below 50°N is linked to a decrease in FDMS in the same latitudinal zone (Mansour et al., 2023a).Owing to the low DMS emissions, the different DMS oxidation patterns may be in competition (Barone et al., 1995); since MSA is formed preferentially through the pathway of OH addition at low temperatures (Shen et al., 2022), the production of MSA may be decreased relative to that of SO4 in the warm southern part of the domain, during summer, leading to the observed decrease in the MSA:SO4 ratio.

Data availability 550
The dataset includes daily MSA and SO4 concentrations at 0.25° × 0.25° spatial resolution over the North Atlantic Ocean from January 1998 to December 2022.The datasets are publicly available in NetCDF format as daily files on the Mendeley online repository at https://doi.org/10.17632/j8bzd5dvpx.1 (Mansour et al., 2023b).

Conclusions
Marine aerosol data can be obtained from in-situ coastal observatories or from shipborne measurements, however, punctual coast observations are limited under the point of view of the spatial representativity, while shipborne measurements suffer of limitations in terms of temporal coverage.Understanding the dynamics of marine-derived biogenic sulfur aerosols and their radiative effects, as well as carrying out relevant scientific studies, requires long-term, continuous and high-resolution (space and time-wise) datasets.To overcome the limitations of punctual measurements, we combined the in-situ observations of sulfur aerosol data at Mace Head and from NAAMES cruises, as dependent variables, and the sea-to-air DMS flux and ECMWF-ERA5 reanalysis meteorological datasets, as independent variables, to investigate the potential of machine learning techniques for the prediction of daily MSA and SO4 sea-level concentrations over the North Atlantic Ocean.We evaluated four machine learning models (i.e., SVM, RE, GPR, and ANN), considering various sets of hyperparameter optimizations.
Our findings demonstrated that the GPR model outperforms other approaches in simulating the concentrations of biogenic sulfur aerosols, capturing up to 86% and 72% of the observed variance in daily MSA and SO4, respectively.This makes the GPR an effective tool for obtaining trustworthy sea-level MSA and SO4 concentrations over the North Atlantic, which may also be successful in other oceanic regions or over the entire global ocean.The impact of the six independent predictors on the simulated MSA and SO4 is further evaluated using the GPR partial dependence analysis, which reveals that the relationships between them are multifaceted rather than linear or monotonically varying.
By the GPR machine learning method, we constructed a novel 0.25°×0.25°resolution daily gridded dataset of in-situ produced biogenic MSA and SO4 concentrations (named IPB-MSA&SO4) covering the North Atlantic Ocean from 1998 to 2022.The dataset represents the sea-level concentrations of MSA and SO4 associated with in-situ production in the MBL, i.e., the concentration of sulfur aerosol species resulting, in each pixel, from the local biogenic emissions in combination with local atmospheric conditions.Other inputs, such as terrestrial emissions or sinking of sulfur species produced in the free troposphere are not accounted for in the present dataset.
Comparison of the GPR-derived MSA with existing CAMS-EAC4 reanalysis product reveals that our high-resolution dataset accurately reproduces the spatial and temporal patterns of the biogenic sulfur aerosol concentration and has high consistency with the independent observations of the Polarstern cruises measurements in the Atlantic.The obtained IPB-MSA&SO4 data were used to analyze the spatiotemporal variations of MSA, SO4, and the ratio between them (MSA:SO4).It was found that the monthly concentrations of MSA and SO4 across the NA are characterized by a significant southward increase in each month, with the exception of summertime when MSA and SO4 displayed more homogeneous spatial patterns with a tendency to higher concentrations over the northeastern part of the domain.The MSA:SO4 exhibits a seasonal variation from winter (low) to summer (high) characterized by a sharp decline from the 50 °N parallel toward the equator mainly in July-August.More profound analyses can be conducted based on the biogenic sulfur aerosol concentration datasets, which could help further understanding of oceanic sulfur-aerosol-cloud interactions.

Figure 1 :
Figure 1: The study region of the North Atlantic Ocean (72° -0° W, 20° -66° N) with bathymetry presented in meters.The gridded bathymetric dataset was extracted from the General Bathymetric Chart of the Oceans (https://www.gebco.net), the GEBCO_2023 Grid.The blue-filled square represents the Mace Head measuring station on the west coast of Ireland and the red points are the sampling points that represent marine conditions in the NAAMES cruises track.The violet points represent the ship track during Polarstern campaigns.
https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.Details of the MSA and SO4 percentile thresholds, along with the amount of data before and after applying the filters are given in Table2.The hourly data after cleanup is used for training/ cross-validation and testing of ML models.

Figure 2 :
Figure 2: The methodology's workflow.Predictors and response variables data preparation, the overall framework of generation 285 https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.The datasets, containing the corresponding predictors and each one of the responses (MSA and SO4) separately, were split randomly into two subsets, defined as the training/cross-validation set and the test/evaluation set for each response.The training/cross-validation sets include 80% of the total points (n = 4920 for MSA and n = 5524 for SO4), while the test/evaluation sets comprise the remaining 20% (n = 1230 for MSA and n = 1381 for SO4).To improve ML algorithms'accuracy and protect against overfitting, a k-fold cross-validation strategy, with k = 5 was used, as this has been shown to provide maximal model prediction robustness and minimal bias(Rodriguez et al., 2010;Fushiki, 2011).The k-fold crossvalidation is a procedure used to estimate the skill of the model on new data and generally results in a less biased estimate of the model skill.The number k-fold refers to how many groups a given data sample is to be split into.In this study where k = 5, the training/cross-validation dataset randomly was further divided into 5 folds of roughly equal size.At each trial, one group is designated as a holdout or validation dataset, while the remaining four groups are designated as training data (Fig. https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.
SO4.The medium Gaussian SVM which utilizes a Gaussian kernel scale equal to the square root of the number of predictors (= 2.4), displayed better performance.The ensemble bagged trees (EBT) of a bootstrap aggregated ensemble and the GPR, which employs the rational quadratic kernel, represent the minimum errors.Finally, a medium ANN of layer size 25 with one fully https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.

Fig.
Fig. 3a-d and Fig.4a-d present the detailed comparison between observed and predicted MSA and SO4, respectively, of the four developed ML optimal models.When compared to the multilinear regression (Table3), it is clear that ML models, in https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.

Figure 3 :
Figure 3: Comparison of predicted and observed MSA on the hourly (left panels) and daily (right panels) scales: (a) GPR, (b) EBT, (c) SVM, and (d) ANN.The validation and test data subsets are used to compute the model's performance.R 2 and RMSE are computed in a logarithmic space, whereas MAE is computed on a normal scale.

Figure 4 :
Figure 4: Comparison of predicted and observed SO4 on the hourly (left panels) and daily (right panels) scales: (a) GPR, (b) EBT, 415 https://doi.org/10.5194/essd-2023-352Preprint.Discussion started: 6 December 2023 c Author(s) 2023.CC BY 4.0 License.once averaged over 2-to-3-days transport tracts.Accordingly, the IPB-MSA&SO4 data presented hereafter are different from 450 the output of a chemical transport model.Nevertheless, we believe that this unprecedented dataset may be useful for many research purposes, for instance, investigating long-term trends, or addressing the interannual or spatial variability in the production of biogenic sulfur aerosol species.Examples of the scientific information that can be extracted from the data and on how they can be compared to model output or in-situ observations are provided in the next Sections.

Figure 5 :
Figure 5: Partial dependence plots of MSA and SO4 as a function of the predictors revealed by the GPR model.
reanalysis datasets.The EAC4 (ECMWF Atmospheric Composition Reanalysis 4) is the fourth generation of the ECMWF global reanalysis dataset of atmospheric composition from the Copernicus Atmosphere Monitoring Service (CAMS).CAMS-EAC4 is a collection of atmospheric composition fields from 2003 to the present, including aerosols and chemical species for which MSA data is available.The spatial resolution of the CAMS datasets is about 0.75° × 0.75° and a 3h temporal resolution.Our datasets have a resolution of 0.25° × 0.25° and start from 1998.To compare the two products, we extracted MSA data from CAMS locally, at the grid cell in front of the MHD station, corresponding to maritime BT timings, and averaged them to daily resolution.Conservatively, the MSA concentration data simulated by GPR were taken from the validation and test sets, which were not included in the model training.Such MSA concentrations at MHD were projected by incorporating predictors along the BTs into consideration to account for the air motion (see Section 3.1.2for details).

Figure 6 :
Figure 6: Comparison between observed MSA at MHD measuring site and both MSA predicted by GPR (a) and MSA extracted from CAMS reanalysis (b).(c) and (d): joint probability histograms between observed MSA and residual errors (observedpredicted); the black dashed lines represent the change of MSA residual errors in each bin.MAE is the mean absolute error, and the relative MAE has been calculated as the MAE divided by the range of observed MSA.(e) and (f): frequency distributions of the residual errors.(g): Seasonal box charts from different datasets.Each box chart displays the median (line inside of each box), the 1 st and 3 rd quartiles (bottom and top edges of each box), the minimum and maximum values that are not outliers (whiskers), and any outliers represented by '+' (computed as values that are more than 1.5 of the interquartile range away from the top or bottom of the box).Box charts whose notches (the shaded region around each median) do not overlap have different medians at the 95% confidence level.

Figure 7 :
Figure 7: (a) Scatter plots between observed MSA during the Polarstern campaigns (Huang et al., 2017) and predicted MSA by GPR, considering (a) 0-day, (b) 1-day, (c) 2-day and (d) 3-day air mass history.(f): Seasonal box charts from different datasets.The features displayed on each box chart are the same as those given in Fig. 6.

Table 3 )
, it is clear that ML models, in general, can reconstruct the observations with a markedly higher R 2 value, which means that the selected ML approaches capture much more of the observed MSA and SO4 variability.While the four applied optimal algorithms have quasi-similar measures, the best model is GPR for predicting MSA and SO4.For hourly MSA (SO4), the GPR achieves the highest R 2