Using Simple, Explainable Neural Networks to Predict the Madden-Julian Oscillation

Few studies have utilized machine learning techniques to predict or understand the Madden-Julian oscillation (MJO), a key source of subseasonal variability and predictability. Here, we present a simple framework for real-time MJO prediction using shallow artificial neural networks (ANNs). We construct two ANN architectures, one deterministic and one probabilistic, that predict a real-time MJO index using maps of tropical variables. These ANNs make skillful MJO predictions out to ∼18 days in October-March and ∼11 days in April-September, outperforming conventional linear models and efficiently capturing aspects of MJO predictability found in more complex, dynamical models. The flexibility and explainability of simple ANN frameworks are highlighted through varying model input and applying ANN explainability techniques that reveal sources and regions important for ANN prediction skill. The accessibility, performance, and efficiency of this simple machine learning framework is more broadly applicable to predict and understand other Earth system phenomena.

2 of 17 study the MJO may thus improve the ability to forecast the oscillation or related S2S processes (e.g., Mayer & Barnes, 2021).
Studies using ML to study the MJO have identified the MJO (Toms et al., 2019), reconstructed past MJO behavior (Dasgupta et al., 2020), or bias-corrected dynamical model output of MJO indices (H. Kim et al., 2021), but only a few studies have examined MJO prediction solely using ML (Hagos et al., 2021;Love & Matthews, 2009). It is thus timely to establish ML frameworks for predicting the MJO and quantify ML model performance compared to other statistical and dynamical models. A further goal of this study is to demonstrate how simple ML models may be used for more than just prediction. While prediction skill is an undeniably important metric for model performance, simple ML models are also flexible tools that invite experimentation and can inform physical understanding of climate processes like the MJO. We highlight this under-appreciated aspect of ML modeling here through experiments changing model input, through exploration of both deterministic and probabilistic ML model architectures, and through application of tools from the field of explainable AI (XAI; Mamalakis et al., 2021;McGovern et al., 2019;Toms et al., 2020). This paper addresses three aspects of using ML to study the MJO: (a) developing ML frameworks, (b) analyzing ML model performance, and (c) demonstrating how ML can inform scientific understanding. We prioritize simple techniques (i.e., shallow, fully-connected artificial neural networks [ANNs]) to establish a benchmark for future ML modeling, to ensure our approach is broadly accessible to the climate community, and to facilitate applying XAI tools. We view this work as a starting point upon which future ML studies focused on the MJO may build. Further, the concept and methods we describe are widely transferable to other areas in Earth science, and may help inform simple ML modeling of other climate phenomena. Section 2 describes the data used in this study. Section 3 describes the ANN models, an ANN explainability method, the linear models we compare the ANN to, and how model skill is assessed. Section 4 describes our results, and Section 5 provides a summary and conclusion.

Data
The predictors of our ANN models are latitude-longitude maps of processed tropical variables from 20°N to 20°S. The predictand is the observed Real-time Multivariate MJO index ("RMM"; Wheeler & Hendon, 2004) which tracks the MJO using an empirical orthogonal function analysis of outgoing longwave radiation (OLR), and zonal wind at 850 and 200 hPa. The index consists of two time series ("RMM1" and "RMM2") that represent the strength and location of the MJO. Plotted on a 2-D plane, the RMM phase angle describes the location, or "phase," of the MJO (e.g., Figure 1), while the RMM amplitude ( √ RMM1 2 +RMM2 2 ) measures MJO strength. RMM has known limitations (Roundy et al., 2009;Straub, 2013) and other MJO indices exist (e.g., Kikuchi et al., 2012;Kiladis et al., 2014;Ventrice et al., 2013), but RMM represents a logical starting point in this work as it is a widely-used, benchmark MJO index suitable for real-time forecasts.
The tropical input data are from three sources: OLR is from the NOAA Interpolated OLR data set (Liebmann & Smith, 1996), sea-surface temperature (SST) is from the NOAA OI SST V2 High Resolution data set (Reynolds et al., 2007), and all other variables are from ERA-5 reanalysis (Hersbach et al., 2020). Additional data from the ERA-20C data set (Poli et al., 2016) is used in the Supporting Information S1, as described therein. We use daily mean data from 1 January 1979 (1982 for SST) to 31 December 2019 that are interpolated onto a common 2.5° × 2.5° grid.
The input data are divided into training, validation, and testing periods. Training data is used to find the weights/ coefficients of the statistical models presented below, validation data is used when tuning model performance, and test data is set aside until the final models are settled upon. Here the training period is from 1 June 1979 to 31 December 2009; the validation data is from 1 January 2010 to 31 December 2015; and the testing is from 1 January 2016 to 30 November 2019. Results assessing the model performance discussed below are evaluated over the testing period, or in a few instances the testing and validation data are used together to increase the sample size, where explicitly noted.
ANN input data are pre-processed in a similar way to the RMM input variables (Wheeler & Hendon, 2004). We subtract the daily climatological mean, first three seasonal-cycle harmonics, and a previous 120-day mean from each point. Variables are not averaged latitudinally because we are interested in how the 2-D structure is utilized 3 of 17 by the ANNs (sensitivity tests exploring latitudinal averaging are discussed in Supporting Information S1). We normalize each variable by subtracting the tropics-wide, all-time mean and dividing by the tropics-wide, all-time standard deviation at each grid point. Tests normalizing each grid point individually showed similar results (not shown). Note that all pre-processing steps (e.g., calculating the climatology, seasonal cycle, and normalization values) are computed only using the training data period, to avoid leakage of information from the validation and testing data into the training data.
In Section 4, the sensitivity of the model to the phase of the stratospheric quasi-biennial oscillation (QBO; Baldwin et al., 2001;Ebdon, 1960;Reed et al., 1961) is assessed. We define the QBO using the monthly, 10°N/S-mean, zonal-mean zonal wind at 50 hPa (U50). Months where U50 is less than the mean minus half a standard deviation are defined as QBO easterly, and months greater than half a standard deviation from the mean are QBO westerly (e.g., Son et al., 2017;Yoo & Son, 2016). Only the November-March period was considered, as a QBO-MJO link is only observed over those months (Martin et al., 2021;Yoo & Son, 2016). To assess whether changes in MJO prediction skill in different QBO phases are statistically significant, we use a bootstrapping test. From all available November-March months in the validation and testing period, we randomly resampled two subsets of months with replacement that have the same size as the observed QBOE and QBOW periods (8 and 12 months, respectively). We compute the change in prediction skill between these two random subsets at each lead time, repeating the process 1,000 times. From that distribution, we assess significance at the 95% confidence level, that is, when the difference between actual QBOE and QBOW periods is less than the 2.5 percentile, or greater than the 97.5 percentile.

Machine Learning and Linear Statistical MJO Models
We first discuss the two types of ANNs and an ANN explainability technique used in this study. We then describe three conventional statistical MJO models used in prior studies (Jiang et al., 2008;Kang & Kim, 2010;Maharaj & Wheeler, 2005;Marshall et al., 2016) that we compare to the ANNs. We conclude with a brief discussion of how model forecasts are evaluated.

ANN Input, Output, and Architecture
We explored two ANN architectures to study the MJO: a "regression model" and a "classification model" (see summary schematic Figure 1). Both ANN architectures input processed latitude-longitude maps from a single day, and output information about the RMM index N days into the future ( Figure 1). Note that inputting tropical maps into the ANN is distinct from the majority of statistical MJO models, which typically input values of the RMM index or a limited number of principal components (Jiang et al., 2008;Kang & Kim, 2010;Waliser, 2012). Using the ANNs in the present manner allows the 2-dimensional structure of a range of different combinations of input variables to be used in the model. In this work we focus on ANNs that input between 1 and 3 different variables. In particular, in this section and Section 4.1 we use ANNs that input three variables simultaneously: OLR, zonal wind at 850 hPa, and zonal wind at 200 hPa ( Figure 1). This combination is among the best-performing across the experiments we conducted and uses the variables that comprise RMM. Exploration of other variables is described in more detail in Section 4.2.
For both regression and classification ANN architectures, a separate ANN is trained for each lead time N from 0 to 20 days. The difference between the regression and classification ANNs is the nature of their outputs. The regression ANN (not to be confused with a linear regression model) outputs RMM1 and RMM2 values (i.e., a vector of two real numbers). Examples of regression ANN output are shown in Figures 1a and 2. Figure 1a shows an example prediction in RMM phase space for a 20-day forecast in the ANN compared to observations. Figure 2 shows lead 0, 5, and 10-day predictions on each day over a particular winter period for RMM1 and RMM2.
In contrast to the regression model, which is deterministic, the classification ANN provides probabilistic forecasts. The classification ANN outputs the probability that the MJO at a given lead time is in each of nine classes (e.g., Figures 1b and 3): either active (RMM amplitude ≥1) in one of the eight canonical RMM phases (Wheeler & Hendon, 2004) or weak ("phase 0"; RMM amplitude <1). The class with the highest probability is considered 5 of 17 the predicted class. An example of the classification ANN output for one initialization date at four different lead times is shown in Figure 3 alongside the observed RMM index.
Both the regression and classification ANNs are simple, shallow, fully-connected neural networks. Both architectures have one layer of 16 nodes that use a rectified linear activation function ("ReLU"). For the regression ANN, the loss function is the mean-squared error, while the classification ANN loss function is the categorical cross-entropy, with a softmax operator applied to the output to normalize class probabilities so predictions sum to 1. To help prevent overfitting, both ANN architectures use ridge regularization (an L 2 norm penalty) to limit the weights of the hidden layer. Both architectures also use early-stopping during training, which monitors the loss on the validation data and stops training once the validation loss plateaus (or increases) for a specified number of epochs. The total number of epochs used to train a given ANN varies by model type and lead time. In general, the regression model at short leads tended to train for approximately 150 epochs while at longer leads trained for around 30-40 epochs. For the classification model, training at short leads took approximately 180 epochs, and at longer leads stopped after around 100 epochs.
For the classification ANN, since weak MJO days are the most common class (∼39% of all days) we avoid class imbalance by randomly subsampling weak MJO days during training so they are 11% of all training days. Weak days are not subsampled over the validation or testing periods. Values of key hyperparameters used in both architectures and additional model details are listed in Table 1. Sensitivity tests varying ANN parameters and input The class with the highest probability is considered the prediction; in this example predictions are phase 2 (lead 0; correct), phase 2 (lead 5; correct), phase 3 (lead 10; correct), and phase 4 (lead 15; correct).
6 of 17 data were explored, and while the present configuration was optimal across the tests conducted, results from a subset of our sensitivity tests are discussed in the Supporting Information S1.
ANN performance is slightly improved if the models are trained separately on different seasons ( Figure S1 in Supporting Information S1), which allows the ANNs to learn more season-specific patterns. This is likely important for the MJO due to its seasonal shifts in behavior, strength, and structure (Hendon & Salby, 1994;Hendon et al., 1999;Zhang & Dong, 2004), and we found splitting the data into two 6-month periods (October-March, or herein "winter," and April-September, or "summer") provided a good trade-off between seasonal specificity and number of training samples.
Finally, in some instances we trained multiple ANNs for the same season and lead time, creating an "ANN ensemble." The ANNs in the ensemble are distinct only in the random initial training weights; otherwise the training data and architecture is the same across all ANNs. The ensemble thus allows us to check for convergence of our results during ANN training, and quantifies sensitivity to ANN initialization.

Layer-Wise Relevance Propagation
To demonstrate how the classification ANN correctly captures regions of importance for predicting the MJO, we use an ANN explainability technique called layer-wise relevance propagation (LRP; Bach et al., 2015;Montavon et al., 2019;Samek et al., 2016). LRP has been used in Earth science as a tool for understanding the decision-making process of ANNs Madakumbura et al., 2021;Mamalakis et al., 2021;Mayer & Barnes, 2021;Toms et al., 2019Toms et al., , 2020, and here, we provide a high-level overview.
Broadly, LRP is an algorithm applied to a trained ANN. After a particular prediction is made, LRP back-propagates that prediction's output through the ANN in reverse. Ultimately, LRP returns a vector of the same size as the input (here a latitude-longitude map of one or more variables), where the returned quantity, termed the "relevance," shows which input points were most important in determining that prediction. By construction, LRP relevance maps are unique to each input sample, not each output class.
We use LRP to analyze output from the classification ANN. There are several different implementation rules for LRP, which differ in the details of how they back-propagate information (see Bach et al., 2015;Mamalakis et al., 2021;Montavon et al., 2019;Samek et al., 2016). Based on results in Mamalakis et al. (2021) assessing various implementations of LRP in a synthetic data set, we use the "LRP z " method, which in their case performed well compared to other implementations of LRP. The LRP z method returns both positive and negative relevance values, but because we are interested in regions that positively contribute to correct predictions, we take Note. Sensitivity tests to various aspects of these and other aspects of the ANN models are discussed in the Supporting Information S1. ANN, artificial neural network.

Traditional Linear MJO Models
We compare ANN performance to three established, statistical MJO models: a persistence model, a vector autoregressive (VAR) model, and a multi-linear regression (MLR) model.
The persistence model is often used as a minimal benchmark for statistical MJO model performance, and forecasts RMM1 and RMM2 values by persisting the initial condition. For a forecast beginning at time t 0 , at each lead time τ the persistence model forecasts: While useful as a benchmark of statistical MJO models, the persistence model is unrealistic, in that it neglects the well-understood propagating nature of the MJO. Thus, we focus primarily on comparing the ANN to two more complex statistical models which better account for the propagating nature of the MJO.
The VAR model (Maharaj & Wheeler, 2005;Marshall et al., 2016) is a linear model which inputs RMM values for a given day and predicts RMM values 1 day into the future. Following Maharaj and Wheeler (2005), this is formulated as: L var is a matrix calculated using a multiple linear regression fit from the training data. As with the ANNs, and following Maharaj and Wheeler (2005), we compute L var separately for winter and summer periods using the same training period as the ANNs. Coefficients of L var match closely with those described in the literature (Maharaj & Wheeler, 2005;Marshall et al., 2016), differing slightly due to our different training period and definition of winter and summer. VAR model forecasts are initialized with the observed RMM1/2 values, and then the initial conditions are stepped forward 1 day at a time out to a lead time of 20 days.
The final simple model, the MLR model (Jiang et al., 2008;Kang & Kim, 2010;Wang et al., 2019), generally follows Kang and Kim (2010), who showed across several statistical models that the MLR model performed best at predicting RMM. The model can be written as: L MLR,τ is a matrix of coefficients calculated using a multiple linear regression fit from the training data. The main differences from the VAR model are the MLR model inputs RMM values on the initial day and 1 day prior, and predicts the RMM1/2 values at a specified lead time of τ. As with the ANNs, we train separate MLR models for each lead time and in winter and summer.

Model Assessment Metrics
To assess model skill in the regression ANN, we utilize the bivariate correlation coefficient (BCC; e.g., H. Kim et al., 2018;Vitart et al., 2017), with a value greater than 0.5 used to denote skill. In the classification ANN, skill is measured using the model's accuracy as well as probability-based skill scores. Following Marshall et al. (2016), who examined probabilistic MJO forecasting in a dynamical model framework, we assess skill at predicting MJO phase using the ranked probability skill score (RPSS). We first calculate the ranked probability score (RPS) for a given statistical model for each lead time as: Here, N is the number of forecasts, M is the number of MJO classes (9), p k is the forecast probability in a given MJO class, and o k is the observed probability (i.e., 1 for the observed phase and 0 for all other phases). Following Marshall et al. (2016), we order the m categories from phase 0 to 8, which captures the canonical MJO phase evolution. When the RPS is calculated for the classification ANN, p k is the model confidence for each phase. For the MLR or VAR model, p k is 1 for the predicted phase and 0 otherwise.
We compute a climatological reference RPS, denoted RPS ref , by calculating the percentage of days the observed MJO is in phases 0-8 across the training data, and using those percentages as p k values across all N forecasts. The RPSS for a given model is then computed as: An RPSS greater than 0 indicates a given model shows better skill than climatology.

Overall Model Performance
In this subsection, we use ANNs that input OLR, zonal wind at 850 hPa, and zonal wind at 200 hPa simultaneously ( Figure 1) and initialize forecasts daily.
Overall, the winter and summer regression ANNs show prediction skill, respectively, of ∼18 and ∼11 days (Figure 4), with small spread across a 10-member ANN ensemble. In both seasons, regression ANNs outperform all three of the linear statistical models after 3-4 days in winter and 4-5 days in summer, showing substantially better skill than persistence and modestly better skill than the MLR and VAR models. The ANNs also demonstrate a lower root-mean-square error than other statistical models (Figure 4) indicating that MJO amplitude in both seasons is better captured, though the RMS improvement compared to the MLR model is small. This places simple ANNs at the forefront of statistical MJO prediction techniques, which is impressive given the simplicity of the ANNs and the fact that no explicit information about the RMM index is passed to the networks. The improved performance of the ANN relative to the VAR model further demonstrates that ANNs learn more than just to 9 of 17 identify the MJO in RMM space and then propagate it east. Further, the higher skill in winter versus summer is consistent with results in most dynamical models (e.g., Vitart, 2017), and is one indication that ANNs are able to reproduce aspects of MJO predictability seen in more complex dynamical models. While linear models also show higher skill in winter than summer, the relative increase between the two seasons is larger for the ANN.
To explore regression ANN performance in more detail, we conditioned forecasts based on various aspects of the initial condition, like MJO phase and strength ( Figure 5). For this plot, the sample size was increased by including both the validation and testing data in the analysis. The regression ANN skill shows relatively small sensitivity to initial MJO phase (Figure 5a), with somewhat higher skill (∼18-19 days) across MJO events initialized in phases 1-3 and lower skill (∼14-15 days) for phases 6 and 8. In contrast to the initial phase, the regression ANN shows substantially more sensitivity to initial MJO amplitude: MJO events that are initially strong or very strong (RMM amplitude >1.5) are skillfully predicted out to ∼20 days in winter, while skill predicting weak winter events is only ∼10 days (Figure 5c). This change in skill based on MJO initial condition is consistent with findings in other statistical and dynamical models (H. Kim et al., 2018). Note that one consequence of the low performance predicting weak MJO events is that the regression ANN struggles to successfully predict MJO initiation. Correctly 10 of 17 predicting the onset of the MJO is a recognized challenge in MJO research and forecasting in general (Ling et al., 2017), and in particular improving this aspect of model performance may be a target for future work.
ANNs also capture more mysterious aspects of MJO predictability, such as the sensitivity to the phase of the stratospheric QBO (Marshall et al., 2017;Martin et al., 2021). Studies using both dynamical and statistical models have found improved MJO prediction skill in QBO easterly months compared to QBO westerly months during November-March (NDJFM; H. Kim et al., 2019;Lim et al., 2019;Marshall et al., 2017;Wang et al., 2019). Defining the QBO using the U50 index, the regression ANN skill during QBO easterly NDJFM periods is nearly 20 days, whereas during QBO westerly periods skill is only 15 days (Figure 5d). This modulation is quantitatively consistent with findings in dynamical models (H. Kim et al., 2019;Lim et al., 2019), however, it is important to note the number of QBO cycles is limited since only winters from 2010 to 2019 are considered in Figure 5d.
A bootstrap analysis of the change in prediction skill between QBOE and QBOW periods in Figure 5 nevertheless found that differences are statistically significant out to 14 days, but not at longer lead times. This lack of significance at leads beyond 2 weeks is consistent with other studies that have noted the limited significance of a QBO impact on MJO prediction skill at longer leads (e.g., H. Kim et al., 2019). A similar analysis using the MLR shows a much smaller QBO modulation over the 2010-2019 period, which is not significant at any lead time (Figure 5d). This is different from the findings of Wang et al. (2019), who showed a clear QBO modulation of MJO prediction skill in an MLR model, but their study examined a longer 1979-2019 period, and used a different MJO index. Thus, while the results here highlight the possibility for the ANN model to detect signals that simpler linear models may not capture, a more detailed and focused study of the impact of the stratosphere on the MJO in an ANN modeling framework is needed, especially one that considers a longer period of time. One approach may be to fluctuate training and testing periods to allow consideration of more QBO cycles in the ANN and linear models, a method that could be explored in future work.
Overall, a strength of the regression ANN is the quantitative information it provides about MJO phase and strength. Further, the regression ANN may prove an efficient framework in which to continue to examine aspects of MJO predictability discussed above, like sensitivity to initial MJO amplitude and phase of the QBO. But a prevalent source of error in the regression ANN is a decrease in the ANN-predicted MJO amplitude at lead times longer than a few days, especially in phases 4-7 (Figure 5b). Amplitude bias is also an issue in the VAR and MLR model, and continuing to explore ways in which it might be overcome in an ANN model is an open challenge. However, this amplitude bias was one motivation for exploring a classification ANN architecture that focuses more directly on MJO phase. Further, the probabilistic nature of the classification ANN makes it a unique simple statistical tool for MJO forecasting.
Assessed via model accuracy, a 10-member classification ANN ensemble performs well predicting active MJO events in RMM phases 1-8 (Figure 6), outperforming the MLR and VAR statistical models after approximately 2-3 days, with accuracy during days 7-20 approximately 20% higher (Figure 6; only the MLR model is shown as VAR results are similar). At lead 0, where the classification model is identifying the MJO, the phase of active MJO events are correctly predicted with an accuracy of ∼80% (Figure 6), comparable to Toms et al. (2019), despite differences in our input variables, data pre-processing, MJO index, and ANN complexity. The majority of incorrectly predicted active MJO events at short leads are near the boundary between two RMM phases and predictions are often incorrect by only one phase.
While classification ANN skill is substantially better than linear models at predicting active MJO events, it struggles to predict weak MJO days, with an accuracy at short leads of only ∼50%, which falls to near random chance after ∼1 week ( Figure 6). This is in part due to the training strategy of the classification ANN: by subsampling weak days during training to prevent class imbalance, the classification model learns not to overemphasize the weak phase. This tendency of the classification ANN to underpredict weak MJO events is in contrast to simple 11 of 17 linear models. The MLR model, for example, has a very high accuracy predicting weak MJO events (Figure 6). At early leads this is because the initial RMM phase is given to the model, and at longer leads the MLR model simply categorizes all MJO events as weak.
Assessing the ANN only via accuracy fails to take full advantage of this model's probabilistic forecasts however. This aspect of the classification ANN is distinct from the deterministic output provided by linear models or even some dynamical models, though Marshall et al. (2016) showed how ensemble runs of dynamical models could be used to provide probabilistic MJO forecasts. Assessing the ANN and linear models via the RPSS (Figure 7a), the classification model performance is clearly superior. The ANN skill remains greater than climatology out to ∼16-18 days in winter (comparable to the regression model skill assessed via the BCC), while the deterministic linear models show skill to about one week. This demonstrates that the classification ANN provides probabilistic information that is useful and adds to the model skill past what deterministic schemes can provide.
Model confidence has clear utility for forecasters and could drive future work in probabilistic MJO prediction (Marshall et al., 2016). It further may be useful in improving understanding of MJO predictability. For example, the classification ANNs probabilistic forecasts are reliable-in the sense that ANN confidence corresponds well with model accuracy-which indicates that model confidence is a useful and meaningful output in this work ( Figure 7). Furthermore, ANN confidence relates to physical aspects of the MJO: we found ANN confidence is closely associated with initial MJO amplitude (correlation coefficients of ∼0.5-0.7 depending on lead), with higher confidence associated with higher initial RMM amplitude (Figure 7b). Research using ANN confidence to identify predictable states of the atmosphere has recently shown promise, including in the context of MJO teleconnections to the extra-tropics Mayer & Barnes, 2021).
The tradeoffs between the simple classification and regression ANN architectures we explored here make choosing a "better" model difficult, and in presenting both we illustrate their respective strengths and limitations. The regression model outputs more precise RMM information and is more readily comparable to existing models, but struggles to predict strong MJO amplitudes at long leads or MJO initiation. This is true even when the regression model was re-trained using fewer weak MJO days to emphasize strong MJO events: little change in performance was seen ( Figure S2 in Supporting Information S1). The classification ANN shows the opposite tendency, overestimating the percentage of active MJO days and struggling to accurately predict weak MJO events. And while the classification ANN cannot provide precise information about MJO strength and location it provides a unique probabilistic output compared to other simple statistical models of the MJO.
Overall, results for both ML architectures show that aspects of the MJO are skillfully predicted by several metrics beyond 2 weeks in winter, and the ANNs outperform existing linear statistical models. A range of sensitivity tests (Text S1 and Figures S3-S5 in Supporting Information S1), including increasing the amount of training data using twentieth-century reanalysis and including additional days in the input, showed comparable performance, though tests were not exhaustive nor explored beyond relatively simple ANN architectures. Also note that while our primary goal here is to introduce and establish a baseline for ML modeling of the MJO, the simple

of 17
ANNs we explored are not yet competitive with most S2S dynamical forecast models (e.g., H. Kim et al., 2018;Vitart, 2017). State-of-the-art dynamic model skill predicting the MJO generally falls between 25 and 35 days when assessed via the BCC (H. Kim et al., 2018;Vitart, 2017), and probabilistic MJO forecasts formed by running ensembles of dynamical models show skill via the RPSS out to approximately 25 days in one S2S model (Marshall et al., 2016). It remains to be seen whether future ML research might improve to the point where it is competitive with dynamical models, but as the next section illustrates, even the simple ANNs introduced here can be used as a tool for more than just prediction, and may help spur new discoveries or generate new hypotheses.

Experimentation and Explainability of ANN Models
A limiting aspect of many standard MJO statistical prediction models, including the persistence, VAR, and MLR models presented here, is they rely entirely on an MJO index as input. In contrast, the ANNs we utilize learn relationships between latitude-longitude maps of one or more tropical variables and an MJO index, meaning that the statistical relationships they learn connect the spatial patterns and interrelationships of the input variables to the behavior of the MJO at various lead times. This flexible framework allows for more experimentation across input variables and input processing strategies than existing approaches, allowing us to explore the impact of different variables on MJO prediction skill. In addition, this framework in conjunction with XAI techniques further illuminates what aspects and spatial regions of the input variables are most important for the model's predictions.
We first illustrate this through classification ANN experiments inputting various combinations of one to three different variables, targeting leads 0, 5, and 10 days for brevity. Overall, model accuracy varies widely depending on input (Figure 8). For example, across 1-variable ANNs (Figure 8a) 850 hPa meridional wind and SST models show much poorer performance than other inputs. In the case of the SST model, this suggests the ocean state alone (when processed to highlight subseasonal variability) does not contain MJO signals the ANN is able to leverage, consistent with findings that sub-seasonal SST variability does not drive the MJO (e.g., Newman et al., 2009). In the case of meridional wind, while the MJO possesses signals in meridional wind associated with Rossby wave gyres (Zhang, 2005), we hypothesize that skill may be low because these signals lack the global-scale coherence seen in variables like zonal wind and OLR that are captured by RMM.
The most accurate models at short leads are those that input 850 hPa and/or 200 hPa zonal winds (Figure 8). This is consistent with literature showing that MJO circulation tends to drive the RMM index (Straub, 2013;Ventrice et al., 2013), an aspect of RMM the ANN has organically learned. Interestingly, skill identifying the MJO at short leads does not necessarily imply similar performance predicting the MJO at longer leads. For example, at lead 0 the 850 and 200 hPa zonal wind model has the highest accuracy among 2-variable models (Figure 8b), but at lead 5 and 10 its accuracy overlaps with other combinations of variables. Best performing models at longer leads are 13 of 17 those that include information about zonal wind and the large-scale thermodynamic or moisture signature of the MJO, as measured for example, by OLR or column water vapor (Figure 8c). Further, the three RMM input variables are not always clearly best performing at leads of 5 and 10 days: a model with total column water, 200 hPa zonal wind and 200 hPa temperature performs as well as or slightly better than the model with 200 and 850 hPa zonal wind and OLR (Figure 8c).
Finally, while more input variables tend to improve model performance (Figure 8), tests showed no substantial improvement using four or more inputs ( Figure S5 in Supporting Information S1), at least among the variables considered here. Whether this is due to the limited complexity of our ANNs, the amount of training data, or because new, meaningful information is difficult to leverage with more variables is not known. Additional variables (perhaps with different preprocessing) will continue to be explored, but these initial tests provide a proof-ofconcept for the kind of experimentation that ANNs afford.
Another advantage of ANNs versus other MJO modeling frameworks is the ability to apply XAI tools like LRP (Section 3.1.2), which identifies sources of ANN prediction skill. As a first example, Figure 9 shows wintertime composite LRP maps using the classification ANN from Section 4.1. LRP maps are shown for lead times of 0 and 10 days, composited across correct ANN predictions when the MJO is in phase 5 at the time of verification. Composites are further restricted to those events when model confidence exceeds the 60th percentile (calculated from the full distribution of model confidence for each lead, not the distribution only over correct predictions).
The LRP plots confirm that the classification ANN focuses on regions central to the MJO. At lead 0, OLR relevance highlights suppressed Indian Ocean convection and active conditions around the Maritime Continent (Figures 9a and 9b), whereas wind fields focus on low-level westerly anomalies around the Maritime Continent (Figures 9c and 9d) and upper level signals in the central and east Pacific (Figures 9e and 9f), all of which are hallmark features of a phase 5 MJO. At lead 10, LRP shows how the ANN accounts for eastward MJO propagation: the maximum relevance for OLR is shifted west relative to lead 0, highlighting strong convection in the eastern Indian Ocean (Figures 9g and 9h). The lead-10 model also focuses on a dipole region of strong low-level 14 of 17 winds near the equatorial Maritime Continent, and upper-level easterly anomalies in the western Indian Ocean (Figures 9i-9l).
Combining both experimentation across model inputs and LRP allows examination of sources of predictability across different variables. For example, while the 3-variable model using total column water vapor, and 200 hPa wind and temperature (gray bar in Figure 8) underperforms the OLR and zonal winds models at lead 0, at lead 10 their performance is comparable; Figure 10 shows the LRP maps from that model. At short leads, total column water vapor relevance matches regions of OLR relevance closely (compare Figures 9b and 10b), and the 200 hPa winds also focus on very similar regions. Upper-level temperatures are most relevant around the western Pacific slightly to the east of enhanced convection, where they show warm anomalies consistent with convective heating in the upper troposphere. In contrast, at 10 day leads the column water vapor shows a clearer difference in relevance compared to the OLR: water vapor signals south of the equator and Maritime Continent, as well as the signals around northern Australia show maxima in relevance. The focus in particular on southern hemisphere moisture signals may be due to the tendency of the winter-time MJO to detour south of the Maritime Continent (D. Kim et al., 2017). Upper-level temperature signals at lead 10 show highest relevance over the Maritime Continent, and focus mainly on near-equatorial warm anomalies in that region. It is noteworthy that while the composite (Figure 10i) shows equally strong temperature signals on the equator and in the subtropics to the west, the LRP map (Figure 10j) indicates the model focuses on the strong equatorial signals.
LRP thus provides information about how the ANN identifies the MJO and what signals across variables are most associated with future MJO behavior. The unique information LRP outputs may be useful to continue to explore sources of MJO prediction skill in simple ANNs, for example, under different large-scale states or for case studies.

Discussion and Conclusions
Motivated by the ability of ML methods to skillfully predict other climate and weather phenomena, as well as a lack of recent progress in statistical MJO modeling, here, we demonstrate how simple ML frameworks can be used to predict the MJO. We establish two straightforward neural network architectures (a regression and Figure 10. Layer-wise relevance propagation example. As in Figure 9, but for the artificial neural network inputting a different set of variables: total column water vapor, 200 hPa temperature (t200), and 200 hPa zonal wind.
MARTIN ET AL.

10.1029/2021MS002774
15 of 17 classification approach) that use shallow ANNs to predict an MJO index. The regression ANN shows prediction skill out to ∼18 days in winter and ∼11 days in summer, which is high skill for a statistical approach. The classification ANN shows probabilistic skill better than climatology out to similar leads of ∼16 days in winter. Both ANN architectures perform better than traditional statistical models and set benchmarks for continued ML modeling of the MJO. ANN prediction skill is not comparable to dynamical models, though continued work may improve ML prediction skill of the MJO, perhaps via other ML modeling frameworks, more advanced input processing, or leveraging larger data sets from climate model simulations. We also emphasize in this work that simple ANNs are efficiently able to reproduce aspects of MJO predictability found in more complex, computationally-expensive dynamical models, such as sensitivity to MJO initial amplitude and phase of the stratospheric QBO. This makes these models affordable tools to continue to study the MJO and MJO predictability. XAI tools can also help illuminate sources and regions of ANN model skill.
This work illustrates how simple ANNs can be used not only for prediction, but also as tools for hypothesis testing and experimentation that might drive new discoveries or scientific insights. While our focus here is on the MJO, the framework we establish is widely applicable to a range of different climate phenomena, especially oscillations that can be represented as simple indices. The performance, affordability, accessibility, and explainability of simple ANNs thus recommends their continued adoption by the climate community.