Linking intrinsic and apparent relationships between phytoplankton and environmental forcings using machine learning – What are the challenges?

Controls on phytoplankton growth are typically determined in two ways: by varying one driver of growth 10 at a time such as nutrient or light in a controlled laboratory setting (intrinsic relationships) or by observing the emergence of relationships in the environment (apparent relationships). However, challenges remain when trying to take the intrinsic relationships found in a lab and scaling them up to the size of ecosystems (i.e., linking intrinsic relationships in the lab to apparent relationships in large ecosystems). We investigated whether machine learning (ML) techniques could help bridge this gap. ML methods have many benefits, including the ability to accurately 15 predict outcomes in complex systems without prior knowledge. Although previous studies have found that ML can find apparent relationships, there has yet to be a systematic study that has examined when and why these apparent relationships will diverge from the underlying intrinsic relationships. To investigate this question, we created three scenarios: one where the intrinsic and apparent relationships operate on the same time and spatial scale, another model where the intrinsic and apparent relationships have different timescales but the same spatial scale, and finally 20 one in which we apply ML to actual ESM output. Our results demonstrated that when intrinsic and apparent relationships are closely related and operate on the same spatial and temporal timescale, ML is able to extract the intrinsic relationships when only provided information about the apparent relationships. However, when the intrinsic and apparent relationships operated on different timescales (as little separation as hourly to daily), the ML methods underestimated the biomass in the intrinsic relationships. This was largely attributable to the decline in the variation 25 of the measurements; the hourly time series had higher variability than the daily, weekly, and monthly-averaged time series. Although the limitations found by ML were overestimated, they were able to produce more realistic shapes of the actual relationships compared to MLR. Future research may use this type of information to investigate which nutrients affect the biomass most when values of the other nutrients change. From our study, it appears that ML can extract useful information from ESM output and could likely do so for observational datasets as well. 30

Linking intrinsic and apparent relationships between phytoplankton and environmental forcings using machine learning -What are the challenges? Christopher Holder 1  Abstract. Controls on phytoplankton growth are typically determined in two ways: by varying one driver of growth 10 at a time such as nutrient or light in a controlled laboratory setting (intrinsic relationships) or by observing the emergence of relationships in the environment (apparent relationships). However, challenges remain when trying to take the intrinsic relationships found in a lab and scaling them up to the size of ecosystems (i.e., linking intrinsic relationships in the lab to apparent relationships in large ecosystems). We investigated whether machine learning (ML) techniques could help bridge this gap. ML methods have many benefits, including the ability to accurately 15 predict outcomes in complex systems without prior knowledge. Although previous studies have found that ML can find apparent relationships, there has yet to be a systematic study that has examined when and why these apparent relationships will diverge from the underlying intrinsic relationships. To investigate this question, we created three scenarios: one where the intrinsic and apparent relationships operate on the same time and spatial scale, another model where the intrinsic and apparent relationships have different timescales but the same spatial scale, and finally 20 one in which we apply ML to actual ESM output. Our results demonstrated that when intrinsic and apparent relationships are closely related and operate on the same spatial and temporal timescale, ML is able to extract the intrinsic relationships when only provided information about the apparent relationships. However, when the intrinsic and apparent relationships operated on different timescales (as little separation as hourly to daily), the ML methods underestimated the biomass in the intrinsic relationships. This was largely attributable to the decline in the variation 25 of the measurements; the hourly time series had higher variability than the daily, weekly, and monthly-averaged time series. Although the limitations found by ML were overestimated, they were able to produce more realistic shapes of the actual relationships compared to MLR. Future research may use this type of information to investigate which nutrients affect the biomass most when values of the other nutrients change. From our study, it appears that ML can extract useful information from ESM output and could likely do so for observational datasets as well. 30 annual cycle of sea ice. These differences would then be expected to be reflected in differences in biogeochemical 70 cycling, independent of differences in the biological models. These studies highlight the importance of constraining not just individual biogeochemical fields, but also their relationships with each other. What is less clear is: 1. Can robust relationships be found? 2. If so, what methods are most skillful in finding them? 3. How do you interpret the apparent relationships that emerge when they diverge from the intrinsic relationships we expect?

75
Recently, researchers have turned to machine learning (ML) to help in uncovering the dynamics of ESMs. ML is capable of fitting a model to a dataset without any prior knowledge of the system and without any of the biases that may come from researchers about what processes are most important. As applied to ESMs, ML has mostly been used to constrain physics parameterizations, such as longwave radiation (Belochitski et al., 2011;Chevallier et al., 1998) and atmospheric convection (Brenowitz and Bretherton, 2018;Gentine et al., 2018;Krasnopolsky et al., 2010Krasnopolsky et al., , 80 2013O'Gorman and Dwyer, 2018;Rasp et al., 2018).
With regards to phytoplankton, ML has not been explicitly applied within ESMs but has been used on phytoplankton observations (Bourel et al., 2017;Flombaum et al., 2020;Kruk and Segura, 2012;Mattei et al., 2018;Olden, 2000;Rivero-Calle et al., 2015;Scardi, 1996Scardi, , 2001Scardi and Harding, 1999) and has used ESM output as 85 input for an ML model trained on phytoplankton observations (Flombaum et al., 2020). Rivero-Calle et al. (2015) used random forest (RF) to identify the drivers of coccolithophore abundance in the North Atlantic through feature importance measures and partial dependence plots. The authors were able to find an apparent relationship between coccolithophore abundance and environmental levels of CO2, which was consistent with intrinsic relationships between coccolithophore growth rates and ambient CO2 reported from 41 laboratory studies. They also found 90 consistency between the apparent and intrinsic relationships between coccolithophores and temperature. While they were able to find links between particular apparent relationships found with the RFs and intrinsic relationships between laboratory studies, it remains unclear when and why this link breaks.
ML has been used to examine apparent relationships of phytoplankton in the environment (Flombaum et al., 2020;95 Rivero-Calle et al., 2015;Scardi, 1996Scardi, , 2001 and it is reasonable to assume that ML could find intrinsic relationships when provided a new independent dataset from laboratory growth experiments. However, it has yet to be determined under what circumstances the apparent relationships captured by ML are no longer equal to the intrinsic relationships that actually control phytoplankton growth. In this paper, we identify two drivers of such divergence. The first is colimitation that limits the biological responses actually found in the ocean, which causes 100 non-parametric ML methods to produce apparently non-physical results. The second is climatological averaging of the input and output variables, which can distort these relationships in the presence of non-linearity. https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License.
To investigate when and why the link between intrinsic and apparent relationships break, we applied ML methods to three scenarios. For the first, we constructed a simple model in which the intrinsic and apparent relationships 105 operated on the same time and spatial scale and were only separated by a scaling factor, but in which the environmental drivers had realistic inter-relationships. In the second, we modified the first scenario to allow the intrinsic and apparent relationships to operate on different timescales -allowing us to evaluate the impact of timeaveraging on the retrieval of intrinsic relationships. In the third, we took the output from an established biogeochemical model in which the biomass is a non-linear function of growth rate to demonstrate the potential 110 information that can be extracted from ESM output using ML.

Scenario 1: Intrinsic and apparent relationships on the same timescale
In the first scenario, we wanted to determine how well different ML methods could extract intrinsic relationships 115 when only provided information on the apparent relationships and when the intrinsic and apparent relationships were operating on the same timescale. In this scenario, the apparent relationships were simply the result of multiplying the intrinsic relationships between predictors and biomass by a scaling constant.
We designed a simple phytoplankton system in which biomass was a function of micronutrient, macronutrient, and 120 light limitations based on realistic inter-relationships between limitations (Eq. 1): where B is the value for biomass (mol kg -1 ), S * is a scaling factor, and Lmicro,macro,irr are the limitation terms for micronutrient (micro), dissolved macronutrient (macro), and light (irradiance; irr), respectively. The scaling factor (1.9x10 -6 mol kg -1 ) was used, so the resulting biomass calculation was in units of mol kg -1 . While simplistic, this is 125 actually the steady-state solution of a simple phytoplankton-zooplankton system when grazing scales as the product of phytoplankton and zooplankton concentrations and zooplankton mortality is quadratic in the zooplankton concentration.
Each of the limitation terms (L in Eq. 1) were functions of Michaelis-Menten growth curves (Eq. 2): 130 where LN is the limitation term for the respective factor, N is the concentration of the nutrient/intensity of the light, and KN is the half-saturation constant specific to each factor. In terms of our nomenclature, Eq. 1 defines the apparent relationship between nutrients, light, and biomass such as might be found in the environment, while Eq. 2 https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. is the intrinsic relationship between nutrient and growth rate such as might be found in the laboratory or coded in an 135 ESM.
For the concentrations of each factor (N in Eq. 2), we took the monthly-averaged value for every lat/lon pair (i.e., 12 monthly values for each lat/lon pair) from the Earth System Model ESM2Mc (Galbraith et al., 2011). ESM2Mc is a fully coupled atmosphere, ocean, sea ice model into which is embedded in an ocean biogeochemical cycling module. 140 Known as BLING (Biogeochemistry with Light, Iron, Nutrients, and Gases; Galbraith et al., 2010), this module carries a macronutrient, a micronutrient, and light as predictive variables and uses them to predict biomass using a highly parameterized ecosystem (described in more detail below). The half-saturation coefficients (KN in Eq. 2) for the macronutrient and micronutrient were also borrowed from BLING with values of 1x10 -7 mol kg -1 and 2x10 -10 mol kg -1 , respectively. The half-saturation coefficient for light was set at 34.3 W m -2 , which was the global mean for 145 the light limitation factor in the ESM2Mc simulation used later in this paper.
The final dataset consisted of three input/predictor variables and one response term with a total of 77,328 "observations." The input variables given to each of three ML methods (Multiple Linear Regression, Random Forests, and Neural Network Ensembles, described in more detail below) were the concentrations (not the limitation 150 terms) for the micronutrient, macronutrient, and light. The response variable was the biomass we calculated from Eq. 1 and 2.
The dataset was then randomly split into training and testing subsets, with 60% of the observations going to the training subset and the remainder going to the testing subset. This provided a convenient way to test the 155 generalizability of each ML method by presenting them with "new" observations from the test subset and ensuring the models did not overfit the data. The input and output values for the training subset were then used to train a model for each ML method. Once each method was trained, we provided the trained models with the input values of the testing subset to acquire their respective predictions. These predictions were then compared to the actual output values of the test subset. To assess model performance, we calculated the coefficient of determination (R 2 ), the mean 160 squared error (MSE), and the root mean squared error (RMSE) between the ML predictions and the actual output values for the training and testing subsets.
Following this, a sensitivity analysis was performed. We allowed one predictor to vary across its min-max range while holding the other two input variables at their 25 th , 50 th (median), and 75 th percentile values. This was repeated 165 for each predictor. This allowed us to isolate the impact of each predictor on the biomass -creating "cross-sections" of the dataset where only one variable changes. For comparison, these values were also run through Eq. 1 and 2 to https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. calculate the "true" response of how the simple phytoplankton model would behave. This allowed us to view which of the models most closely reproduced the underlying intrinsic relationships of the simple phytoplankton model.

170
This method of sensitivity analysis is in contrast to partial dependence plots (PDPs), which are commonly used in Using the predictions produced from the sensitivity analyses, we also computed the half-saturation constants for each curve. Using the Matlab function "fitnlm," the half-saturation constants were determined by fitting a non-linear 180 regression model to each sensitivity analysis curve matching the form of a Michaelis-Menten curve (Eq. 3): where B corresponds to the biomass predictions from the sensitivity analyses, N represents the nutrient concentrations from the sensitivity analyses, and α1 and α2 are the constants that are being estimated by the nonlinear regression model. α2 was taken as the estimation of the half-saturation coefficient for each sensitivity analysis 185 curve.

Scenario 2: Intrinsic and apparent relationships on different timescales
In Scenario 1, the intrinsic and apparent relationships differed only by a scale factor and operated at the same time and spatial scale. However, in reality, input variables (such as light) vary on hourly time scales, while satellite 190 observations and ESM model output are often only available on monthly-averaged timescales. So the reality is that even if a system is controlled by intrinsic relationships, the apparent relationships gained from climatological variables on long timescales will not reproduce these intrinsic relationships since the average light (irradiance) limitation is not equal to the limitation given the averaged light value (Eq. 4).
where the overbar denotes a time-average, and Irr stands for irradiance (light). We wanted to investigate how such time averaging biased our estimation of the intrinsic relationships from the apparent ones; i.e., how does the link between the intrinsic and apparent relationships change with different amounts of averaging over time?
For the short timescale intrinsic relationships, we took daily inputs for the three predictor variables for one year from 200 the BLING model. We further reduced the timescale from days to hours to introduce daily variability for the irradiance variable relative to the latitude, longitude, and time of year (Eq. 5).
where IrrInt is the hourly interpolated value of irradiance, Irrdaily is the daily-mean value of irradiance, t is the hour of the day being interpolated, tSunrise is the hour of sunrise, and TDay is the total length of the day. The resulting curve 205 preserves the day to day variation in the daily mean irradiance due to clouds but allows a realistic variation over the course of the day. The hourly values for the micronutrient and macronutrient were assigned using a standard interpolation between each of the daily values. These hourly interpolated values were then used to calculate the hourly biomass from Eq. 1 and 2. Note that we are not claiming the biomass itself would be zero at night but assume that on a long enough timescale, it should approach the average of the hourly biomass. 210 To simulate apparent relationships, we smoothed the hourly values for both biomass and the input variables into daily, weekly, and monthly averages for each lat/lon point. To reiterate, the intrinsic and apparent relationships in Scenario 2 differed in timescales, but not in spatial scales. Each dataset was then analyzed following steps similar to those outlined in Scenario 1; constructing training and testing subsets, using the same variables for input to predict 215 the output (biomass), and using the same ML methods. To assess each method's performance, we calculated the R 2 value, MSE, and RMSE between the predictions and observations for the training and testing subsets. We also performed a sensitivity analysis and calculated half-saturation constants similar to those described above.

Scenario 3: BLING biogeochemical model 220
As a demonstration of their capabilities, the ML methods were also applied directly to monthly averaged output from the BLING model itself using the same predictors in Scenarios 1 and 2, but using the biomass calculated from the actual BLING model. As described in Galbraith et al. (2010), BLING is a biogeochemical model where biomass is diagnosed as a non-linear function of the growth rate smoothed in time. The growth rates, in turn, have the form is that the light limitation term is calculated using a variable Chl:C ratio following the theory of Geider et al. (1997). 230 The variation of the Chl:C ratio would correspond to a in Scenarios 1 and 2 which adjusts in response to both changes in irradiance (if nutrient is low) or changes in nutrient (if irradiance is high) as well as changes in temperature. Given the resulting growth rate , the total biomass then asymptotes towards where is a grazing rate, the tilde denotes an average over a few days and * is just the biomass constant that we 235 saw in the previous two scenarios. Growth rates and biomass are then combined to drive the uptake and watercolumn cycling of micronutrient and macronutrient within a coarse-resolution version of the GFDL ESM2M fully coupled model (Galbraith et al., 2011), denoted as ESM2Mc.
As described in Galbraith et al. (2011) and Bahl et al. (2019), ESM2Mc produces relatively realistic spatial 240 distributions of nutrients, oxygen, and radiocarbon. Although simpler in its configuration relative to models such as TOPAZ (Tracers of Ocean Productivity with Allometric Zooplankton; Dunne et al., 2013), it has been demonstrated that in a higher-resolution physical model BLING produces simulations of mean nutrients, anthropogenic carbon uptake, and oceanic deoxygenation under global warming that are almost identical to such complicated models (Galbraith et al., 2015). 245 We chose to use BLING for three main reasons. The first is that we know it produces robust apparent relationships between nutrients, light, and biomass by construction -although these relationships can be relatively complicatedparticularly insofar as iron and light colimitation is involved (Galbraith et al., 2010). As such, it represents a reasonable challenge for an ML method to recover such non-linear relationships. The second is that we know how 250 these relationships are determined by the underlying intrinsic relationships between limiting factors and growth.
Models with more complicated ecosystems (including explicit zooplankton and grazing interactions between functional groups) may exhibit more complicated time-dependence that would confuse such a straightforward linkage between phytoplankton growth limitation and biomass. The third is that despite its simplicity, the model has relatively realistic annual mean distributions of surface nutrients, iron, and chlorophyll, and under global warming, it 255 simulates changes in oxygen and anthropogenic carbon uptake that are similar to much more complicated ESMs (Galbraith et al., 2015).

ML Algorithms
We chose to use Random Forests (RFs) and Neural Network Ensembles (NNEs) in this manuscript because they are 260 two of the more popular ML algorithms. Although other ML methods exist, the list of possible choices is rather long. With the main purpose of this paper being to examine the link between intrinsic and apparent relationships on https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. different time and spatial scales, it was decided that the number of ML algorithms being compared would be limited to RFs and NNEs given their popularity in studying ecological systems. The results of the ML methods were compared against Multiple Linear Regression (MLR) to demonstrate the better performance of ML as compared to 265 more conventional empirical methods. Although the stronger performance of ML may seem clear to experienced ML experts, it was not immediately evident to us since we previously had little experience with ML. Therefore, MLR is included here for demonstrative purposes for less experienced ML users.
It should be noted that we are not trying to suggest that MLR is always ineffective for studying ecological systems. 270 MLR is a very useful and informative approach for studying linear relationships within marine ecological systems (Chase et al., 2007;Harding et al., 2015;Kruk et al., 2011). However, we highly encourage our readers to try ML as it can provide insight into the non-linear portions of a dataset.

Random Forests 275
RFs are an ensemble ML method utilizing a large number of decision trees to turn "weak learners" into a single "strong learner" by averaging multiple outputs (Breiman, 2001). In general, RFs work by sampling (with replacement) about two-thirds of a dataset and constructing a decision tree. At each split, the random forest takes a random subset of the predictors and examines which variable can be used to split a given set of points into two maximally distinct groups. This use of random predictor subsets helps to ensure the model is not overfitting the data. 280 The process of splitting the data is repeated until an optimal tree is constructed or until the stopping criteria are met, such as a set number of observations in every branch (then called a leaf / final node). The process of constructing a tree is then repeated a specified number of times, which results in a group (i.e., "forest") of decision trees. Random forests can also be used to construct regression trees in which a new set of observations traverse each decision tree with its associated predictor values and the result from each tree is aggregated into an averaged value. 285 Here, we used the same parameters for RF in the three scenarios to allow for a direct comparison between the scenarios and to minimize the possible avenues for errors. Each RF scenario was implemented using the TreeBagger function in MATLAB 2019b, where 500 decision trees were constructed with each terminal node resulting in a minimum of five observations per node. An optimization was performed to decide the number of decision trees that 290 minimized the error while still having a relatively short runtime of only several minutes. For reproducible results, the random number generator was set to "twister" with an integer of "123". Any remaining options were left to their default values in the TreeBagger function.

Neural Networks 295
NNs are another type of ML that has become increasingly popular in ecological applications (Flombaum et al., 2020;Franceschini et al., 2019;Guégan et al., 1998;Lek et al., 1996aLek et al., , 1996bMattei et al., 2018;Olden, 2000;Özesmi and Özesmi, 1999;Scardi, 1996Scardi, , 2001Scardi and Harding, 1999). Scardi (1996) used NNs to model phytoplankton primary production in the Chesapeake and Delaware Bays. Lek et al. (1996a) demonstrated the ability of NNs to explain trout abundance using several environmental variables through the use of the "profiling" 300 method, a type of variable importance metric that averages the results of multiple sensitivity analyses to acquire the importance of each variable across its range of values.
Feed-forward NNs consist of nodes connected by synapses (or weights) and biases with one input layer, (usually) at least one hidden layer, and one output layer. The nodes of the input layer correspond to the input values of the 305 predictor variables, and the hidden and output layer nodes each contain an "activation function." Each node from one layer is connected to all other nodes before and after it. The values from the input layer are transformed by the weights and biases connecting the input layer to the hidden layer, put through the activation function of the hidden layer, modified by the weights and biases connecting the hidden layer to the output layer, and finally entered into the final activation function of the output node. 310 The output (predictions) from this forward pass through the network is compared to the actual values, and the error is calculated. This error is then used to update the weights with a backward pass through the network using backpropagation. The process is repeated a specified number of times or until some optimal stopping criteria are met, such as error minimization or validation checks where the error has increased a specified number of times. For 315 a more in-depth discussion of NNs, see Schmidhuber (2015).
For this particular study, we use neural network ensembles (NNEs), which are a collection of NNs whose predictions are averaged into a single prediction. It has been demonstrated that NNEs can outperform single NNs and increase the performance of a model by reducing the generalization error (Hansen and Salamon, 1990). 320 To minimize the differences between scenarios, we used the same framework for the NNs in each scenario. Each NN consisted of three input nodes (one for each of the predictor variables), 25 nodes in the hidden layer, and one output node. The activation function within the hidden nodes was a hyperbolic tangent sigmoid function and the activation function within the output node used a linear function. The stopping criteria for each NN was set as a 325 validation check such that the training stopped when the error between the predictions and observations increased for six consecutive epochs. An optimization was performed to decide the number of nodes in the hidden layer that https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. minimized the error while maintaining a short training time. Additionally, sensitivity analyses were performed using different activation functions to ensure the choice of activation function had minimal effect on the outcome and apparent relationships found by the NNEs. 330 Each NNE scenario used the feedforwardnet function in MATLAB 2019b. Any options not previously specified remained at their default values in the feedforwardnet function. The NNEs contained ten individual NNs for each scenario. For reproducibility, the random number generator was set to "twister," and the random number seed was set to the respective number of its NN (i.e., 1, 2, 3, up to 10). 335 Each variable was scaled between -1 and 1 based on its respective maximum and minimum. This step ensures that no values are too close to the limits of the hyperbolic tangent sigmoid activation function, which would significantly increase the training time of each NN. These scalings were also applied to the RF and MLR methods for consistency between methods and the scaling did not affect the results of either method (results not shown). The results 340 presented in this paper were then transformed back to their original scales to avoid confusion from scaling.

Scenario 1: Intrinsic and apparent relationships on the same timescale
In Scenario 1, the RF and NNE both outperformed the MLR as demonstrated by higher R 2 values, lower MSE, and 345 lower RMSE (Table 1). The decreased performance of the MLR is not inherently surprising, given the non-linearity of the underlying model, but it does demonstrate that the range of nutrients and light produced as inputs by ESM2Mc is capable of producing a non-linear response. Additionally, each method showed similar performances between the training and testing subsets suggesting adequate capture of the model dynamics in both subsets.

350
From the spatial distributions of the true response and the predictions from each method, it can be observed that the RF and NNE showed the closest agreement with the true response (Fig. 1). Although MLR was able to reproduce the general trend of the highest biomass in the low latitudes and low biomass in the high latitudes, it was not able to predict higher biomass values.

355
In addition to examining whether the different ML methods got the "right" answer, we also interrogated these methods to look at how different predictors contributed to the answer, and whether these contributions matched the intrinsic relationships between the predictors and growth rate as we had put into the model. The MLR (red dashed https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. lines) shows very little response to changes in macronutrient (left column), an unrealistic negative response to increases in micronutrient (central column), and a reasonable (albeit linear) match to the light response (right 360 column). By contrast, the response to any predictor for the NNE (green dashed lines) showed agreement with the true response of the model (black lines) in all circumstances insofar as the true response was always within the standard deviation of the NNE predictions. The RF prediction of the response to a given predictor (blue dashed lines) showed agreement with the true response when the other predictors are fixed at the lower percentiles (top two rows) but began deviating in the higher percentiles. 365 When we computed an "effective" half-saturation for the nutrient curves in the top row of Fig. 2, we got values for that were far lower than the actual ones specified in the model (Table 4). The "effective" half-saturation of when other predictors are held at their 25 th percentile for the micro-and macronutrient were underestimated by one and two orders of magnitude, respectively. It was only at the higher percentiles that the micronutrient "effective" half-370 saturation was adequately captured when the macronutrient was not limiting. Furthermore, the "effective" halfsaturation of the macronutrient was not captured even when the other variables were held at their 75 th percentiles because the 75 th percentile of the micronutrient still limited growth.

Scenario 2: Intrinsic and apparent relationships on different timescales 375
As in Scenario 1, the RF and NNE outperformed the MLR based on the performance metrics for the daily, weekly, and monthly time-averaged scenarios ( Table 2). The comparable performances between the training and testing subsets suggest a sufficient sampling of the data for each method to capture the dynamics of the underlying model.
Examining the monthly apparent relationships found for each method and comparing them to the true intrinsic 380 relationships shows that none of the methods were able to reproduce the true intrinsic relationships, with one exception being the 25 th percentile plot of the micronutrient (Fig. 3). This result was consistent across the different timescales, and the sensitivity analysis showed little difference in the predicted relationships between the daily, weekly, and monthly averaged timescales for the NNEs (Fig. 4). Interestingly, the NNE and RF appeared to asymptote near the proper concentration for the micro-and macronutrients (Fig. 3). For example, the true response 385 of the macronutrient has a sharp asymptote at low concentrations, and the NNE and RF appear to mimic this asymptote, even though the predicted biomass concentration is lower than the true biomass (Fig. 3). Furthermore, the ML methods were able to mimic the non-linearity of the system, which is an important result regardless.
When the "effective" half-saturation constants were computed for the daily, weekly, and monthly NNEs, many of 390 the light and micronutrient half-saturations were of the same magnitude as the true value (Table 4). This is an interesting result given that the predicted biomass concentrations were much lower than the true response.

Scenario 3: BLING biogeochemical model
When run in the full ESM, the BLING biogeochemistry does end up producing surface biomass, which is a strong 395 function of the growth rate (Fig. 6a) with a non-linear relationship as in Eq. 7. As the growth rate, in turn, is given by Eq. 6, we can also examine how the monthly mean limitation terms for nutrient and light compare with the means given by computing the limitations with monthly mean values of nutrients, , and . As shown in Fig. 6b, the nutrient limitation is relatively well captured using the monthly mean values, although there is a tendency for the monthly means to underestimate moderate values of nutrient limitation. Further analysis shows that this is due to the 400 interaction between micro-and macro-nutrient limitation -with the average of the minimum limitation being somewhat higher than the minimum of the average limitation. However, using the actual monthly mean values of , and (Fig. 6c) causes the light limitation to be systematically biased high.
When MLR and ML were applied to the output of one of the BLING simulations, the RF and NNE again 405 outperformed the MLR in all of the performance metrics for the training and testing subsets ( Table 3). The RF performed slightly better than the NNE (R 2 of 0.973 vs. 0.942) on the training subset, but this difference was lessened in the testing subset (R 2 of 0.945 vs. 0.939). Although there were slight differences in the RF performance between the training and testing subsets, the values of the performance metrics were of the same magnitude. The similar performance for each method across the training and testing subset expresses the adequate capture of the 410 dataset's variability.
The sensitivity analysis shows the biomass continues to increase with an eventual asymptote even in the 75 th percentile plots (Fig. 7). However, the NNE curve for biomass is strongly hindered in the light and macronutrient plots even at higher percentiles, while large increases are observed in the micronutrient plots when light and 415 macronutrient are at higher concentrations.

Scenario 1: Intrinsic and apparent relationships on the same timescale
In the first scenario, our main objective was to determine if ML methods could extract intrinsic relationships when 420 given information on the apparent relationships and reasonable spatiotemporal distributions of colimitation when the intrinsic and apparent relationships were operating on the same timescale.
Despite the fact that it agreed well with the observations, the RF prediction deviated from the true response to a given variable when other variables are held at higher percentiles (Fig. 2). This can likely be explained by the range 425 of the training subset and how RFs acquire their predictions. When presented with predictor information, RFs rely on the information contained within their training data. If they are presented with predictor information that goes outside the range of the dataspace of the training set, RFs will provide a prediction based on the range of the training set. When performing the sensitivity analysis, the values of the predictors in the higher percentiles were probably outside the range of the training subset. For example, the bottom left plot of Fig. 2 shows how RF deviates from the 430 true response as the concentration of the macronutrient increases -actually decreasing as nutrient increases despite the fact that such a result is not programmed into the underlying model. Although there may be observations in the training subset where the light and micronutrient are at their 75 th percentile values when the macronutrient is low, there likely are not any observations where high levels of the macronutrient, micronutrient, and light are cooccurring. Without any observations meeting that criteria, the RF provided the highest prediction it could based on 435 the training information. We discuss this point in more detail below.
In contrast to the RF's inability to extrapolate outside the training range, the NNE showed its capability to make predictions on observations on which it was not trained (Fig. 2). Note, however, that while we have programmed Michaelis-Menten intrinsic dependencies for individual limitations into our model, we do not get Michaelis-Menten 440 type curves back for macro-and micronutrients when the other variables were set at low percentiles. The reason is that Liebig's law of the minimum applies to the two nutrient limitations so that when the micronutrient is low, it prevents the entire Michaelis-Menten curve for the macronutrient from being seen.
When the "effective" half-saturation was computed for the macro-and micronutrient curves in Fig. 2, they were far 445 lower than the true values in the lower percentiles because of colimitations between the macro-and micronutrients (Table 4). While mathematically obvious, this result has implications for attempts to extract (and interpret) KN from observational datasets, such that one would expect colimitation to produce a systematic underestimation of KN. https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License.
With respect to our main objective for Scenario 1, it was evident that only the NNE was able to extract the intrinsic 450 relationships from information on the apparent relationships. This was due in large part to its capability of extrapolating outside the range of the training dataset, whereas RFs were constrained by training data, and MLR was limited by its inherent linearity and simplicity.

Scenario 2: Intrinsic and apparent relationships on different timescales 455
In Scenario 1, the intrinsic and apparent relationships were simply related by a scaling factor. In practice, the relationships are more difficult to connect to each other. For the second scenario, both the output biomass and predictors (light, macronutrient, and micronutrient) were averaged over daily, weekly, and monthly timescales. Our main objective was to investigate how the link between intrinsic and apparent relationships changed when using climatologically averaged data -as is generally the case for observational studies. 460 When comparing the apparent relationships of the time-averaged datasets with those of the hourly intrinsic relationships, the methods almost always underestimated the true response to light and nutrient ( Fig. 3 and 4). This result is not entirely unexpected. The averaging of the hourly values into daily, weekly, and monthly timescales quickly leads to a loss of variability, especially for light (Fig. 5). In fact, the variability was lost in the daily time 465 averaging with the longer timescales showing only small differences in the possible range of values (Fig. 5). The loss of variability means that the light limitation computed from the averaged light is systematically higher than the averaged light limitation. To match the observed biomass, the asymptotic biomass at high light has to be systematically lower (see Appendix A for the mathematical proof). Differences were much smaller for nutrients as they varied much less over the course of a month in our dataset. Our results emphasize that when comparing 470 apparent relationships in the environment to intrinsic relationships from the laboratory, it is essential to take into account which timescales of variability averaging has removed. Insofar as most variability is at hourly time scales, daily-, weekly-, and monthly-averaged data will produce very similar apparent relationships (Fig. 4). But if there was a strong week-to-week variability in some predictor, this may not be the case.

475
Although the ML methods were unable to reproduce the intrinsic relationships, they were able to model the general trend of the relationships (i.e., higher concentrations of each predictor lead to higher biomass; eventual asymptotes in the macro-and micronutrient). Additionally, the NNE and RF appeared to asymptote at the same nutrient concentrations as that of the true response (Fig. 3). This type of result can help to answer questions such as: which nutrients have the greatest impact on biomass when other nutrients change? This effectively allows one to examine 480 the interactions between variables. https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License.
The computed "effective" half-saturation constants were interestingly of the same magnitude as the true value (Table 4). This is a clear demonstration of the potential hazards one may face when inferring KN from observational datasets, as mentioned previously in Scenario 1. A further implication from Scenario 1 is reinforced in the 485 computation of the "effective" half-saturation of the macronutrient, such that it is underestimated by an order of magnitude relative to the true value because of micronutrient limitation (Table 4).

Scenario 3: BLING biogeochemical model
To demonstrate their capabilities, each method was also applied directly to the monthly averaged output of one of 490 the BLING simulations. The main purpose of the final scenario was to demonstrate the capabilities of the ML methods when applied to actual ESM output with the reasoning that if the ML methods were unable to provide useful information on BLING, they would also fail on more complex models.
The large increases in biomass in the micronutrient plots and hindrance of biomass in the light and macronutrient 495 plots suggest that the system is limited by the concentration of micronutrient (Fig. 7). The biomass remained low even when macronutrient and light were at favorable levels because even when at the 75 th percentile value, the micronutrient was still limiting (Fig. 8). Conceptually this makes sense since the micronutrient limitation in the BLING model hinders growth, but also limits the efficiency of light-harvesting (Galbraith et al., 2010).
Additionally, the computation of the "effective" half-saturation constants demonstrates that the half-saturation 500 constant for light drops sharply as nutrients drop (Table 4).

Conclusions
Our main objective in this manuscript was to use ML to determine under what conditions intrinsic and apparent relationships between phytoplankton are no longer equal, to identify whether such divergence depends on the ML 505 method or how the input data is handled, and to understand how such divergence is related to underlying biological dynamics.
In Scenario 1, we demonstrated that NNEs were capable of extracting the intrinsic non-linear relationships from the apparent relationships when apparent and intrinsic relationships were operating on the same timescale and when 510 they were linearly related by a scaling factor. However, this relationship broke down in Scenario 2, when timeaveraging caused a systematic overestimate of light limitation. We note that while Scenario 2 illustrates that the ability to recover the intrinsic relationship with light may be compromised by temporal averaging, spatial averaging could have a similar impact. If, for example, we imagine coastal regions in which nutrient delivery is very patchy, a spatially averaged relationship between biomass and nutrient may also show similar biases. So it appears that the 515 extent to which ML methods can extract the intrinsic relationships depends on the extent to which the variability of https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. the system is captured; i.e., more coverage of the parameter space at higher temporal resolution would yield more accurate estimates of the intrinsic relationships.
Although RFs and NNEs were unable to extract the exact intrinsic relationships due to time-averaging, they were 520 able to model the general trend of the relationships in Scenario 2. This mimicking of the non-linear relationships can still be a valuable tool for examining a dataset, in that one can assess which combinations of nutrients most affect the biomass and can get a relative estimate of the uncertainty in the prediction; effectively, allowing one to examine the interactions between variables and their effect on the outcome. This was further demonstrated in Scenario 3 when it was observed that even at high concentrations of light and macronutrient, biomass was limited by the 525 concentration of micronutrient. This observation was not immediately expected or evident to us when we applied these methods to the BLING model. Similar insights might be found in other ESM output or observational datasets.
In addition to climatological averaging, it was also observed that colimitation could affect the apparent relationships found by ML. In each Scenario, we observed instances where biomass was low even when the concentrations of one 530 of the drivers were high. This was due to one of the other drivers being limiting. Had we not known what the true intrinsic relationships were, it may have appeared that the ML methods were producing unrealistic results. For example, if the real world behaved like the right-hand column of Fig. 7, we might conclude that phytoplankton were strongly photo-inhibited, even though our results with BLING (which does not have explicit photoinhibition) demonstrate that this is not a necessary conclusion. This demonstrates the caution one must take in interpreting these 535 kinds of systems.
Both RFs and NNEs performed well when the predictions they were asked to make were within the range of the training data. However, the sensitivity analyses illustrated the impact of RFs inability to extrapolate outside that range and that RF's suggested systematic decreases in biomass at high values of a limiting variable. Nonetheless, 540 RFs were able to capture the same relationships as the NNEs when the sensitivity analysis was querying environments within the range of the training data. It seems that as long as RFs are presented with information across the range of the dataset, RFs will perform just as well as NNEs in a sensitivity analysis. This strengthens the conclusions of Rivero-Calle et al. (2015) in that physiologically reasonable relationships between forcing variables and biomass found using RF are reliable so long as the forcing variables (in this case pCO2 and temperature) vary 545 over their entire range independently of other variables (nutrients and light). However, when variation in pCO2 is related to variation in nutrients and light (i.e., in the seasonal climatology where pCO2 is high in the winter, light is low, and nutrients are high) RFs are unable to extract a clear signal of pCO2 limitation. https://doi.org/10.5194/bg-2020-262 Preprint. Discussion started: 22 July 2020 c Author(s) 2020. CC BY 4.0 License. This paper examined two of the more popular ML algorithms, but many other methods exist as well. Future research 550 should attempt to use some of the other methods to see how they perform. However, one of the main takeaways would likely be the same regardless of the ML method; the training data should contain sufficient coverage of the range of forcing and the spatiotemporal variability within a system in order to capture the intrinsic relationships.
This paper also limited the number of predictor variables for each scenario so that the sensitivity analyses could be 555 easily visualized. In the real world, phytoplankton may be limited by more physical and biological processes, making the visualization of the sensitivity analyses impractical due to the sheer number of possible interactions that would have to be considered. In cases such as those, it would be beneficial to perform some form of importance analysis or dimensionality reduction to remove insignificant predictor variables, after which sensitivity analyses could be done on the remaining predictors. 560 ML techniques have several benefits that could make them useful for biological oceanographers and ecosystem modelers. Many ML methods (including the two presented here) do not require any prior knowledge of a system to construct a model. Additionally, new methods are continually being developed for viewing the dynamics of the ML models. Given these advantages, ML could provide a compact form for representing relationships between 565 ecosystem parameters such as biomass and primary productivity and their environmental drivers (nutrients and light) in observational data and complex models. Preliminary work indicates that we can use NNEs in particular to: 1.
Compare model relationships with those derived from observational datasets, rather than simply using spatial patterns of errors. 2. Evaluate whether differences between models reflect important differences in biological parameters or whether they are due to differences in the physical circulation. We would expect that two different 570 physical models run with the same biological scheme would produce the same relationships. 3. Evaluating whether global warming really would be expected to drive ecosystems outside their historical parameter range. We will report on these results in a future manuscript.
so that if we are trying to fit a curve of the form We would expect that * < * . 585

Code and Data Availability
The Matlab scripts for the construction of the figures and tables, the scripts for training and testing the MLR, RF, and NNE algorithms, and the source files for each scenario are available in the Zenodo data repository (https://doi.org/10.5281/zenodo.3932388, Holder and Gnanadesikan, 2020). 590

Author Contribution
CH implemented the ML algorithms, analyzed the results for each scenario, and wrote the majority of the manuscript. AG helped in developing the simple phytoplankton models for Scenarios 1 and 2, provided the biogeochemical model output used in Scenario 3, and helped in the analysis of the results.

Competing Interest 595
The authors declare that they have no conflicts of interest.