Vertical structure and physical processes of the Madden-Julian oscillation: Linking hindcast fidelity to simulated diabatic heating and moistening

Many theories for the Madden-Julian oscillation (MJO) focus on diabatic processes, particularly the evolution of vertical heating and moistening. Poor MJO performance in weather and climate models is often blamed on biases in these processes and their interactions with the large-scale circulation. We introduce one of the three components of a model evaluation project, which aims to connect MJO ﬁdelity in models to their representations of several physical processes, focusing on diabatic heating and moistening. This component consists of 20 day hindcasts, initialized daily during two MJO events in winter 2009–2010. The 13 models exhibit a range of skill: several have accurate forecasts to 20 days lead, while others perform similarly to statistical models (8–11 days). Models that maintain the observed MJO amplitude accurately predict propagation, but not vice versa. We ﬁnd no link between hindcast ﬁdelity and the precipitation-moisture relationship, in contrast to other recent studies. There is also no relationship between models’ performance and the evolution of their diabatic heating proﬁles with rain rate. A more robust association emerges between models’ ﬁdelity and net moistening: the highest-skill models show a clear transition from low-level moistening for light rainfall to midlevel moistening at moderate rainfall and upper level moistening for heavy rainfall. The midlevel moistening, arising from both dynamics and physics, may be most important. Accurately representing many processes may be necessary but not suﬃcient for capturing the MJO, which suggests that models fail to predict the MJO for a broad range of reasons and limits the possibility of ﬁnding a panacea.


Introduction
The Madden-Julian oscillation (MJO) [Madden and Julian, 1972] is a key driver of tropical and extratropical circulation and precipitation variability on subseasonal scales. The MJO comprises quasiperiodic fluctuations between active and suppressed convective states in the tropics. MJO active events typically form in the equatorial Indian Ocean, grow to become large-scale envelopes of convective activity that often span several thousand kilometers zonally, then propagate east through the Maritime Continent to the West Pacific with a phase speed of approximately 5 m s −1 [e.g., Madden and Julian, 1994;Zhang, 2005]. Suppressed conditions (e.g., strong surface insolation, light winds, warming sea surface temperatures (SSTs)) precede and follow General circulation models (GCMs) used for numerical weather prediction (NWP) and climate simulations frequently suffer from weak subseasonal variability in tropical convection, poor spatial organization, and deficient eastward propagation [e.g., Slingo et al., 1996;Lin et al., 2006;Sperber and Annalamai, 2008;Kim et al., 2009;Hung et al., 2013]. Although MJO prediction skill in NWP has improved in some GCMs recently [e.g., Waliser, 2012;Vitart, 2014], the prediction skill of most GCMs lags predictability by at least 10 days [Neena et al., 2014]. These biases have come under increasing scrutiny recently, as modeling centers focus on the performance of their subseasonal (2-6 week) prediction systems, which fill a critical gap between medium-range and seasonal forecasts [Gottschalck et al., 2010]. Attempts to improve GCM representations of the MJO have centered on the subgrid-scale parameterization of convection, including adding or increasing the sensitivity of a trigger based on low-level moisture convergence [e.g., Wang and Schlesinger, 1999] or increasing the sensitivity of a convective parcel to environmental humidity [e.g., Bechtold et al., 2008;Chikira and Sugiyama, 2010;Subramanian et al., 2011;Klingaman and Woolnough, 2014]; producing more upper level heating from stratiform precipitation to increase the covariance between heating and temperature anomalies [Fu and Wang, 2009;Seo and Wang, 2010], based on similar instabilities found for convectively coupled waves [e.g., Mapes, 2000]; and introducing momentum mixing by convection [e.g., Zhou et al., 2012]. At a basic level, many of these studies emphasize the importance of promoting instabilities between certain components of the GCM diabatic heating profiles and the large-scale circulation, which are often damped in models with a weak MJO. However, the targeted components of the heating profiles vary among the studies. Further, the improvements found often depend on the GCM, or even on the particular GCM version, limiting their applicability and confounding parameterization development efforts.
Advances in GCM parameterization development are also complicated by a wide range of hypotheses, based on observations or on theoretical studies with reduced-complexity models, concerning which component of the MJO-associated diabatic heating profile is most critical to promoting and maintaining convective instability. These hypotheses have targeted free-tropospheric moisture convergence from low-level heating [e.g., Lau and Peng, 1987], frictional moisture convergence in the boundary layer [e.g., Wang and Rui, 1990], local cycles of moisture discharge and recharge [e.g., Bladé and Hartmann, 1993], and cloud-radiation interactions and the radiative heating [e.g., Slingo and Madden, 1991;Hu and Randall, 1994;Raymond, 2001;Stephens et al., 2004]. There are considerable disagreements among observations and reanalysis products in the shape and amplitude of MJO-associated heating profiles, even among products processed from the same Tropical Rainfall Measuring Mission instruments [Ling and Zhang, 2011;Jiang et al., 2011], which likely result not only from discrepancies among the data sets but also from sensitivities to MJO compositing method. While reanalyses typically show a heating structure that tilts westward with height-producing low-level heating and moisture convergence east of the strongest convection [e.g., Lappen and Schumacher, 2012]-this tilt is missing in some observed products. Recent modeling work [Lappen and Schumacher, 2014] suggests that this low-level heating from shallow convection may be more critical to the representation of the MJO than the "top-heavy" heating, caused by strong stratiform components, that past studies emphasized.
Over the past few years, observational and modeling studies have focused attention on the role of the MJO-suppressed phase in maintaining the oscillation. Using observations and reanalysis, Kim et al. [2014a] found that Indian Ocean active phases propagated east more frequently, and more coherently, when the suppressed phase in the West Pacific was stronger, causing enhanced meridional moisture advection east of the active phase. Woolnough et al. [2010] analyzed cloud-resolving model (CRM) and single-column model (SCM) simulations of suppressed conditions during the Tropical Ocean-Global Atmosphere (TOGA) Coupled Ocean-Atmosphere Response Experiment (TOGA COARE) [Webster and Yang, 1992]. SCMs with less precipitation during the suppressed phase maintained a moister atmosphere and produced a more gradual transition from shallow to deep convection, in closer agreement with the CRMs; SCMs with more precipitation in the suppressed phase showed a sharper transition to deep convection that was delayed relative to the CRMs and 10.1002/2014JD022374 the SCMs with less precipitation. The SCMs with more precipitation in the suppressed phase also produced less precipitation in the active phase, implying weaker intraseasonal variability. Most GCMs also produce a too sharp transition to active conditions, which studies have suggested is caused by a poor relationship between convection and environmental moisture, particularly in the midtroposphere where GCMs often exhibit a mean dry bias, perhaps from too little convective moistening [e.g., Thayer-Calder and Randall, 2009;Xavier, 2012].
In this study, we link the fidelity of the simulated MJO in 13 GCMs (section 2.2) to their simulated diabatic heating and moistening associated with tropical convection. These 13 GCMs have participated in a novel model evaluation project (section 2.1), providing high-frequency output of heating and moistening tendencies from resolved and subgrid-scale processes (section 2.4). This study focuses on 20 day hindcast simulations of two strong MJO events (section 2.2.1) to analyze how the accuracy of models' MJO predictions relates to the vertical structure of their diabatic processes. This model evaluation project is the first to collect such detailed and high-frequency output from many GCMs, providing a unique opportunity to compare GCMs at the level of individual physical processes.
After describing the hindcast component of the evaluation project in section 2, we assess the fidelity of MJO predictions in these GCMs for these two events in section 3.1; section 3.2 examines precipitation-moisture and precipitation-vertical velocity sensitivities in these GCMs, which have been hypothesized to be important for representing the MJO [e.g., Thayer-Calder and Randall, 2009]; we analyze the heating and moistening profiles as a function of rain rate in sections 3.4 and 3.5, respectively. We discuss our findings and experiment design in section 4 before presenting our key conclusions in section 5.

Experiment Design and Data Sets
We analyze one of the three components of the "Vertical structure and diabatic processes of the MJO" global model evaluation project. The project is organized and supported by the Global Atmospheric Systems Studies (GASS) panel of the Global Energy and Water Exchanges (GEWEX) and the World Climate

Project Design
The project aims to characterize the representation of the diabatic heating and moistening processes associated with the MJO in GCMs, including assessing those processes against analyses and observations [Petch et al., 2011]. The project also aims to link GCM representations of these processes to the quality of MJO simulation. Detailed analysis of diabatic processes requires a bespoke experimental design, since more general intercomparisons (e.g., the Coupled Model Intercomparison Project) do not archive subdaily parameterization tendencies.
The project comprises three-component experiments that take advantage of known links between biases in short-range weather forecasts and long-term climate simulations [e.g., Boyle et al., 2008;Xie et al., 2012;Ma et al., 2013Ma et al., , 2014. The three experiments are (1) 20 year simulations, using either atmosphere-only GCMs forced by observed SSTs or atmosphere-ocean coupled GCMs; (2) a series of 2 day hindcasts, initialized once per day, for two strong MJO events (section 2.2) during the Year of Tropical Convection (YOTC) ; and (3) as in (2) but for 20 day hindcasts. The goals of (1) are to determine the ability of each GCM to represent the MJO when the GCM is near its preferred climate, as well as its ability to generate MJO events [Jiang et al., 2015]. Component (2), which acquired time step frequency GCM output, enables a detailed investigation of the behavior of individual physics schemes when the GCM is in a near-observed state that contains an active MJO . Component (3) is the focus of this manuscript and seeks to link (1) and (2). By using 20 day hindcasts with a broader range of initial dates than (2), but acquiring three-hourly output (section 2.4), this component was designed to connect degradations in MJO performance with lead time to biases in simulated diabatic processes. Klingaman et al. [2015] synthesize and summarize the conclusions of the entire project.

Experiment Design 2.2.1. Main Set of Dynamical Models
The GCMs participating in the 20 day hindcast component are listed in Table 1, along with the modeling center that submitted the simulations, the native resolutions, and a reference that provides further details on GCM formulation and parameterizations.  Figure 1a shows "observed" [Wheeler and Hendon, 2004] real-time multivariate MJO (RMM) indices for these periods; see section 2.5 for the computation method. Tropical Rainfall Measuring Mission (TRMM) rainfall (section 2.6) for these events is shown in Figures 1b and 1c, with the range of initial and final dates for the hindcasts indicated. The initial dates were chosen so that the genesis and lysis of the events were captured at least at 10 days lead time. All hindcasts were initialized from 00Z European Centre for Medium-Range Weather Forecasts (ECMWF) analyses produced for YOTC at 16 km horizontal resolution (ECMWF-YOTC) [Moncrieff et al., 2012] [Cavanaugh et al., 2014]. We use the LIM forecasts to provide a more skillful baseline for the dynamical GCMs than simple persistence. In a LIM, anomalies propagate and decay according to a linear operator capable only of exponential (decaying sinusoidal) behavior. In the Miami LIM, this operator was built from 0 day and 2 day lagged covariance matrices of daily time-longitude sections of high-pass anomalies in 1979-2005, using all seasons. Input data were daily 15 • S-15 • N averages of National Centers for Environmental Prediction-National Center for Atmospheric Research (NCEP-NCAR) reanalysis 850 hPa and 200 hPa zonal wind, plus National Oceanic and Atmospheric Administration (NOAA) outgoing longwave radiation (OLR), all in 10 • longitude bins. High-pass anomalies were defined by subtracting the composite seasonal cycle and the mean of the previous 120 days, as in Wheeler and Hendon [2004], to eliminate false skill from seasonally locked or low-frequency variability.
The Scripps LIM differs from the Miami LIM chiefly in that (a) it also uses meridional winds at 850 hPa and 200 hPa, in addition to OLR and zonal winds; (b) it uses grid point fields, rather than latitude averages; and (c) it was trained on 7 day smoothed fields, rather than daily data. Further details can be found in Cavanaugh et al. [2014]. The LIM used here is one of the best performing LIMs from that study: it uses 24 empirical orthogonal functions (EOFs) of wind fields, four EOFs of OLR, and a training lead time of 8 days. We use both the Miami and Scripps LIMs here because they differ in their skill over the short hindcast period in this study (section 3.1) and so provide a rudimentary estimate of the uncertainty in statistical skill.

Further Model Details
Here we provide further details on GCM configurations beyond those in Table 1, as well as differences in GCM configurations from the references provided.

10.1002/2014JD022374
CanCM4 is the only atmosphere-ocean coupled model that produced 20 day hindcasts. CanCM4 uses the Canadian Atmospheric Model (CanAM4) [von Salzen et al., 2013] and the Canadian Ocean Model (CanOM4) [Merryfield et al., 2013]. Rather than initializing from ECMWF-YOTC analyses, to maintain atmosphere-ocean coupled balance, CanCM4 hindcasts are initialized from analyses produced by assimilating six-hourly ECMWF-Interim reanalysis (ERA-Interim) atmospheric temperature, horizontal winds, and specific humidity into CanAM4, with a spectral cutoff at wave number 21 (T21). During the assimilation, daily mean CanAM4 surface forcing is applied to CanOM4; additionally, the CanOM4 SSTs are relaxed to the NCEP optimally interpolated SST analysis with a 3 day time constant. The 20 day hindcasts are initialized from this coupled assimilation system and then run freely (i.e., with no relaxation or assimilation). The same procedure is used operationally at the Canadian Centre for Climate Modelling and Analysis but assimilating operational analyses rather than ERA-Interim. We discuss the implications of the discrepancy in the initial conditions in section 4.
CAM5 was configured with a finite volume dynamical core. CAM5-ZM differs from CAM5 in the addition of the Song and Zhang [2011] convective microphysics to the Morrison and Gettelman [2008] stratiform microphysics scheme from CAM5. CNRM-AM is the atmospheric component of CNRM-CM5 used in CMIP5 simulations and described in Voldoire et al. [2013]. The convection scheme in GISS-E2 was modified to improve tropical intraseasonal variability Kim et al., 2012]. The NavGEM1 configuration used here, for which there is no published reference, differs from the configuration in Hogan et al. [2014] in that it lacks prognostic cloud water and uses the radiation scheme of Harshvardhan et al. [1987].
The choice of SST boundary condition was left to each modeling center. CanCM4 used its predicted SST; SPCAM3, MRI-AGCM3, NavGEM1, and CNRM-AM persisted the initial SST; CAM5, NICAM, Integrated Forecast System (IFS), ECEarth3, MetUM-GA3, and GISS-E2 persisted the initial SST anomaly with respect to a time-varying climatology; and CAM5-ZM, MIROC5, and Goddard Earth Observing System version 5 (GEOS5) used time-varying daily observed SSTs. There is no correlation between the choice of SST boundary condition and model hindcast skill (section 3.1).

Model Output
All centers provided three-hourly output of a standard set of prognostic and diagnostic fields (e.g., temperature, winds, geopotential height, and surface fluxes). Most importantly for this project, all centers also provided three-hourly Eulerian tendencies for temperature, specific humidity, and zonal and meridional wind from the model dynamics and from the convection, longwave and shortwave radiation, boundary layer, and large-scale cloud and precipitation parameterizations, as applicable. A full list of the output fields requested is at http://climate.ncas.ac.uk/pmwiki/MJO_Diabatic_Hindcast. Prior to submission, modeling centers interpolated all data onto a 2.5 • × 2.5 • horizontal grid and 24 pressure levels (every 25 hPa from 1000 hPa to 900 hPa then every 50 hPa to 50 hPa), including the subgrid tendencies.

Computing RMM Indices
Daily RMM indices are computed from GCM data following the procedure in Gottschalck et al. [2010]. From daily mean GCM data (computed from three-hourly output), we remove the mean and first three harmonics of the annual cycle of NOAA OLR and ERA-Interim reanalysis zonal wind . Next, we latitude average each field between 15 • N and 15 • S. We perform the same two steps on NOAA OLR and ECMWF-YOTC winds for the hindcast period. For each hindcast lead day t (where t = 1 is the first day of data), we remove a mean of the previous 120 days of latitude-averaged data from each field. The mean is composed of 121−t days of "observations" (NOAA OLR and ECMWF-YOTC winds) and t days of hindcast data. We use NOAA OLR and ECMWF-YOTC winds, instead of the model analysis data suggested in Gottschalck et al. [2010], because many of the GCMs that submitted data do not have an analysis system. We then project the resulting model anomalies onto the Wheeler and Hendon [2004] RMM EOFs to compute the RMM indices; the Wheeler and Hendon [2004] EOFs are computed from 1979 to 2001 NOAA OLR and NCEP-NCAR reanalysis winds.
We also compute observed RMM indices by replacing the model data above with daily mean NOAA OLR and ECMWF-YOTC winds and repeating the procedure. We verify the GCM RMM indices against these observed values. We verify the LIM RMM indices against the LIMs' own lagged initial (day 0) values, because the LIMs were built using different OLR and zonal wind data sets than the ones employed here. This means that the LIM day 0 error is zero, which also improves the clarity of our skill figures (section 3.1).

Other Data Sets
Model rainfall is compared against three-hourly TRMM product 3B42 version 7 [Kummerow et al., 1998], which is a combination of many passive microwave and infrared sensors, calibrated against and merged with gauge  Table 1). The codes appear next to the final point of each line for a qualitative ranking.
data. Importantly for our analysis, TRMM 3B42 has been shown to underdetect very light rainfall over the tropical oceans, due to biases in one of the microwave sensors, but performs reliably for moderate and heavy rain rates [Huffman et al., 2007]. In section 3.2, we compare GCM precipitation-moisture relationships to an observed product composed of TRMM rainfall and specific humidity from the ECMWF-YOTC 0-23 h forecasts (ECMWF-YOTC/TRMM). We also compare heating and moistening tendencies from the hindcast GCMs to those from the ECMWF-YOTC 0-23 h forecasts. We stress that this biases the GCM intercomparison in favor of the IFS 20 day hindcasts, but also note that we find considerable differences between ECMWF-YOTC and the IFS 10.1002/2014JD022374 hindcasts in many diagnostics. Further, we note that the IFS 20 day hindcasts were performed with a later IFS version (37r3) than ECMWF-YOTC (35r3).

Results
In this manuscript, it is not often possible to include results for each type of analysis from all 13 GCMs. Instead, we select models from across the continuum of model skill, as discussed at the end of section 3.1. For all analyses presented, results for all GCMs are publicly available on the project website: http://climate.ncas.ac. uk/pmwiki/MJO_Diabatic_Hindcast.

MJO Predictions for Cases E and F
Bivariate correlations and root-mean-square error (RMSE) [Gottschalck et al., 2010] between hindcast and observed RMM indices were computed with hindcast lead time, using all start dates ( Figure 2). RMSE values are not shown; error growth is roughly equal and opposite to the decline in the bivariate correlations. The models display a wide variety of "skill" for these events: CAM5 and CAM5-ZM (C5 and CZ in Figure 2) exhibit more than 20 days skill (using a threshold correlation of 0.7, used for all skill measures); many other GCMs reach 16-18 days skill; several GCMs and the two LIMs (LS and LM) have 8-12 days skill ( Figure 2a). One model, CanCM4 (CC), has skill similar to a persistence forecast (PE; 7 days) and lower than the two LIMs. Performance for the two cases individually (not shown) strongly resembles that for the two combined. We use the term skill very loosely here, since we have a set of only 94 hindcasts over two events. This experiment does not aim to provide a thorough assessment of GCM prediction skill, but to link GCM fidelity in predicting these MJO events to representations of diabatic processes (section 2.2). For a robust assessment of the hindcast skill of contemporary GCMs, see the Intraseasonal Variability Hindcast Experiment [e.g., Neena et al., 2014], which obtained 20 years of hindcasts for many models. Neena et al. [2014] found that most prediction systems had useful skill for the MJO to 10-20 days for individual ensemble members-in line with our skill estimates for the YoTC cases-and 15-25 days for the ensemble mean.
The set of GCM hindcasts contains two sensitivity tests: air-sea coupling, using the 15 day MetUM atmosphere-only (MA) and coupled (MC) configurations, and convective microphysics, using CAM5 and CAM5-ZM. Coupling and convective microphysics offer slight improvements in the bivariate correlation for these two cases. For air-sea coupling, this confirms the analysis of Shelly et al. [2014], who found small benefits to skill during YOTC.
To determine how the degradation in RMM predictions with lead time for a given GCM is associated with the three RMM components, bivariate correlations and RMSEs were computed for the contributions of OLR, U850, and U200 to the indices. For many GCMs (e.g., NavGEM1 (NR) and ECEarth3 (E3)), correlations decline more quickly, and RMSEs grow more quickly (not shown), for OLR ( Figure 2b) than for U850 ( Figure 2c) or U200 ( Figure 2d). We note, however, that OLR contributes less to the RMM indices than U200 and U850 [e.g., Straub, 2013;Ling et al., 2014]. Several GCMs, such as GEOS5 (NA) and GISS-E2 (GI), have low initial OLR correlations due to the spin-up of convection from the ECMWF-YOTC analyses. For GISS-E2, CanCM4 (CC), MRI-AGCM3 (MR), and SPCAM3 (SP), however, the correlations decline most quickly for U200. This suggests that upper level circulation biases are the limiting factors in predictions for these cases in these GCMs, which may indicate issues with either the vertical mixing of momentum by convection or with the depth of convection. SPCAM3 does not include vertical momentum mixing, producing an overly strong vertical shear of zonal wind (not shown). This behavior is consistent with a rapid drift toward the model's climatology, which exhibits excessive superrotation [Pritchard et al., 2014]. The other three GCMs include parameterized convective momentum transport (CMT), as do all GCMs except SPCAM3, but biases in CMT and upper level circulation can reduce simulated MJO activity [e.g., Ling et al., 2009;Klingaman and Woolnough, 2014]. Because U200 represents the largest contribution to the RMM indices of the three components [Wheeler and Hendon, 2004;Straub, 2013], using the RMM indices over an OLR-only index [e.g., Sperber and Kim, 2012] penalizes models with poor upper level wind variability. We compare RMM skill to a diagnostic of precipitation skill at the end of this section.
Correlations for CAM5-ZM slightly exceed those for CAM5 in all components, suggesting modest improvements to MJO predictions from convective microphysics. Air-sea coupling in MetUM improves predictions of the OLR components by 2 days (12 days to 14 days; Figure 2b) while slightly improving U850 and worsening U200. Shelly et al. [2014] also found that coupling improved OLR predictions, primarily over the Maritime Continent.
We note, without explanation, that the Miami LIM (LM) produces more skillful predictions than the Scripps LIM (LS). Our hindcast period is too short to rigorously compare these LIMs; we use them only as baseline measures of skill for the GCMs. The LIMs lie at the lower end of the envelope of GCMs, exhibiting similar skill to the least skillful dynamical models (e.g., CanCM4 and NavGEM1; Figure 2a). This represents an improvement in the skill of dynamical models over the past few years, given that in previous assessments dynamical models were shown to have similar or only slightly higher skill compared to statistical models [e.g., Seo et al., 2009;Seo, 2009;Kang and Kim, 2010], although some studies have found that dynamical models could outperform simpler statistical models [e.g., Rashid et al., 2011].
Most models damp RMM amplitude with lead time, either slightly or strongly (Figure 2e). This is true even for GCMs with 16-18 days skill (Figure 2a), such as GEOS5 and ECEarth3. Many models show greater and faster declines in RMM1 than RMM2 magnitude (not shown), suggesting issues predicting the MJO over the Maritime Continent. The four GCMs that maintain the observed RMM amplitude throughout the forecast also show high RMM bivariate correlations: CAM5, CAM5-ZM, IFS, and MRI-AGCM3. SPCAM3 and CNRM-AM (CN) amplify the MJO signal with lead time, but neither has skill beyond 15 days ( Figure 2a).
As a measure of MJO propagation, we compute lag correlations of RMM1 and RMM2 and find the lag of the maximum correlation magnitude, similar to Sperber and , which used EOFs of intraseasonal OLR only. The observed RMM indices have a maximum lag correlation at 7 days (Figure 2f ), which is close to the 8-9 day lag of the maximum correlation in Wheeler and Hendon [2004]. For each GCM, we calculate lag correlations between RMM1 and RMM2 using all start dates and 5 day ranges of lead times (e.g., days 1-5 and days 6-10) to reduce high-frequency noise in the metric due to the limited sample of hindcasts. All GCMs show longer-than-observed lags between RMM1 and RMM2, indicative of slow MJO propagation (Figure 2f ), although in most GCMs the propagation speed increases with lead time as amplitude decays. There is little correspondence between RMM amplitude and this propagation metric: CanCM4 and GEOS5 show weak amplitude, but CanCM4 produces near-observed propagation, while GEOS5 is too slow; CNRM-AM and SPCAM3 produce overly strong amplitudes, but CNRM-AM produces near-observed propagation, while SPCAM3 is slower than observed.
For the remainder of our analysis, we compare processes among these GCMs, focusing on diabatic heating (section 3.4) and moistening (section 3.5). Our objective is to compare GCM representations of these processes to skill in predicting these cases, which requires ranking the models by some skill measure. We rank the GCMs by the lead time at which the RMM bivariate correlation first falls below the 0.7 threshold. Table 2 shows this ranking, along with a tercile-based system for classifying the models into groups of relatively "higher, " "moderate, " and "lower" MJO fidelity. These classifications are used for only two purposes: to select example GCMs from each group to display in subsequent figures and to color the GCM identification codes on several figures to provide a general indication of fidelity. When we correlate MJO fidelity with process-oriented diagnostics, we always use the values of skill shown in Table 2, not the relative classifications.
There are two issues with our approach: (a) the limited hindcast set may not represent overall GCM performance and (b) the conclusions may vary based on the skill measure chosen. For (a), we emphasize that our ranking of model performance is based solely on these two cases; it may not reflect the typical performance of these models for predicting the MJO. We discuss this further in section 4. To address (b), we present an alternate quantification of MJO fidelity, based on the method adopted by Jiang et al. [2015] for assessing the 20 year climate simulations from this project. Jiang et al. [2015] computed pattern correlations between simulated and observed rainfall Hövmoller diagrams over an equatorial band (averaged 5 • S-5 • N), using 20-100 day band pass-filtered rainfall and regressing the rainfall against base regions in the Indian Ocean and West Pacific.
Here we form Hövmoller diagrams over the same region, but using unfiltered GCM and TRMM 3 h rainfall and without using regressions. For each of the two hindcast cases, we form Hövmollers from each GCM at fixed lead times and compute the pattern correlation with the TRMM Hövmoller diagram from the same period; we then average the pattern correlations from the two cases. Figure 3 demonstrates that models with higher MJO fidelity by the RMM bivariate correlation measure (green model codes) tend to produce higher skill in equatorially averaged rainfall; models with lower bivariate RMM correlations (red codes) tend to produce lower rainfall skill. There are several exceptions: GISS-E2 (CNRM-AM) performs better (worse) by the rainfall measure than by the RMM correlation measure. As above, the RMM indices are dominated by U200 and U850, so it is not surprising that GCMs that exhibit better performance for the OLR component than for the wind components (e.g., GISS-E2) also demonstrate better precipitation fidelity relative to their RMM skill scores. Still, the fact that the two measures of fidelity agree reasonably well increases our confidence that the bivariate RMM correlation reliably represents overall MJO fidelity in these hindcasts. Figure 3 also shows that all models have considerably lower skill in predicting latitude-averaged precipitation than in predicting the RMM indices, including for the OLR component of the RMM indices ( Figure 2b). We note, however, that our precipitation Hövmoller diagrams were not filtered to isolate the MJO signal, unlike those in Jiang et al. [2015]; they contain many smaller-scale features in space and time that may be unrelated to the MJO and that models would be unlikely to capture at lead times beyond a few days, if at all. Again, we emphasize that this experiment is poorly suited to a robust and precise quantification of MJO performance in these models. Figure 3. For each GCM, the pattern correlation of Hövmoller diagrams of latitude-averaged (10 • S-10 • N), daily mean rainfall, constructed across 60 • E-180 • , between each GCM and TRMM observations. For each of the two hindcast cases, GCM Hövmoller diagrams are constructed at the fixed lead times shown in the horizontal axis; the value shown is the mean pattern correlation over the two cases, using all start dates. Models are identified by two-letter codes (Table 1), which are colored by relative fidelity (Table 2) and placed next to the final point of each line to give a qualitative ranking.   (Figures 4a and 4b).
The moderate-fidelity models, such as MRI-AGCM3 and CNRM-AM (Figures 4c and 4d), show larger errors in RMM amplitude, particularly as lead time increases. MIROC5 and CanCM4 are examples of lower fidelity models, which show still larger errors in amplitude and phase (Figures 4e and 4f ). In particular, CanCM4 is strongly attracted to the unit circle in RMM phase space. We use these as examples of high-, moderate-, and low-fidelity models throughout the manuscript, adding other models that show noteworthy or distinct behavior as appropriate.

Relationships Between Rainfall and Specific Humidity and Vertical Velocity
Recent studies have pointed to the importance of the sensitivity of parameterized convection to environmental moisture, particularly in the midtroposphere, for GCM representations of the MJO [e.g., Thayer-Calder and Randall, 2009;Kim et al., 2012;Xavier, 2012]. Low sensitivity causes excessive rainfall, relative to observations, in relatively dry columns, reducing instability and weakening the MJO-suppressed phase; it also causes a peaked rain rate distribution at values close to those required to maintain radiative-convective equilibrium (6-16 mm d −1 ), which suppresses subseasonal variability.
To examine if the rainfall-moisture relationship is related to MJO fidelity, grid point daily mean specific humidity anomalies are composited on grid point daily mean rainfall in the warm pool (10 • S-10 • N, 60 • E-180 • ). . For (a) ECMWF-YOTC specific humidity and TRMM precipitation and (b-h) specific humidity and precipitation from each GCM, shaded rectangles show mean anomalies in grid point specific humidity, relative to the zonal mean, across the range of grid point precipitation rates (mm d −1 ) on the horizontal axis. The dashed line shows the probability distribution of precipitation rates, using the right-hand vertical axis. All panels use daily mean data for all hindcast start dates (Figures 5b-5h) and days 3-20; statistics are accumulated over 10 • S-10 • N and 60 • E-180 • ; zonal means are computed using all longitudes.
Specific humidity anomalies are computed from the zonal mean, calculated daily from all longitudes (0 • -360 • ). When a lead time-dependent climatology is not available, as in this study, using anomalies from a time-varying zonal mean limits the effects of drift in GCM humidity away from the initial analysis [Klingaman and Woolnough, 2014]. Statistics are accumulated over all start dates and lead times. Due to the use of a time-varying zonal mean, these statistics vary little with lead time in any GCM (not shown), indicating that they are intrinsic properties of the GCM and do not depend on the presence of an active MJO (i.e., several GCMs have little MJO activity after 10-15 days). The mean specific humidity anomaly is computed for each rain rate range ( Figure 5); the ranges are qualitatively chosen so that TRMM is approximately evenly distributed. We also computed the mean specific humidity anomaly for ranges of rain rate percentiles in the GCMs and TRMM-ensuring a uniform rain rate distribution-and reached similar conclusions.
GCMs are compared to ECMWF-YOTC specific humidity composited on TRMM (ECMWF-YOTC/TRMM; Figure 5a); using ECMWF-YOTC, 0-23 h forecast rainfall produced similar results. ECMWF-YOTC/TRMM shows that rain rates as light as 0.8 mm d −1 are associated with positive low-level humidity anomalies, with the entire column anomalously moist by 4 mm d −1 . Between these rates is a gradual transition from low-level to whole-column positive anomalies. NICAM, which produces accurate forecasts, simulates a rainfall-moisture  (Table 2). Least squares regression lines and correlation coefficients are also shown. We do not assess MJO fidelity in NICAM, so we place its symbol on the horizontal axis and exclude it from the correlation and regression analysis. relationship (Figure 5b) that resembles ECMWF-YOTC/TRMM, but light rain occurs in columns that are too anomalously moist, and the model has a much more gradual transition from midlevel to whole-column positive anomalies than ECMWF-YOTC, in which a sharp transition occurs around 5 mm d −1 . Nearly all nonzero rain rates are associated with positive low-level humidity anomalies, which is balanced by a higher frequency of <0.1 mm d −1 rain rates than in TRMM.
IFS also produces a rainfall-moisture relationship similar to ECMWF-YOTC/TRMM, which is perhaps not surprising given that the ECMWF-YOTC analyses were produced with a closely related version of the IFS (Figure 5d and section 2.6). In particular, IFS shows a sharp transition from midlevel to whole-column positive moisture anomalies around 5 mm d −1 . There are substantial differences between ECMWF-YOTC/TRMM and IFS, however, which maximize 1-2 days into the IFS hindcasts (not shown), including much stronger dry anomalies at lower rain rates-indicating dry columns are associated with more rainfall-and a shift to maximum humidity anomalies above the freezing level (near 550 hPa) for rain rates >1.0 mm d −1 .
Despite its high fidelity, CAM5-ZM displays a poor rainfall-moisture relationship and rain rate distribution relative to ECMWF-YOTC/TRMM (Figure 5c). Rain rates up to 2.5 mm d −1 are associated with wholecolumn dry anomalies; there is a sharp transition from whole-column dry anomalies to whole-column moist anomalies, and the rain rate distribution peaks near 10 mm d −1 . The lack of near-zero rain rates suggests that CAM5-ZM produces at least some precipitation nearly everywhere in the tropics at all times. CAM5 displays very similar results (not shown). GEOS5, another higher-fidelity model, shows many of the same features (Figure 5e), although the rain rate distribution is somewhat closer to TRMM. GEOS5 also has a "tongue" of dry anomalies near the top of the boundary layer (900 hPa) that extends out to 3-5 mm d −1 , whereas ECMWF-YOTC and other models show "tongues" of moist anomalies at this height and these rain rates. Broadly similar rainfall-moisture relationships to the high-fidelity models can be seen in the moderate-fidelity models, including MRI-AGCM3 and CNRM-AM (Figures 5f and 5g), as well as in the lower fidelity models, including MIROC5 and CanCM4 (Figures 5h and 5i). With the exception of NICAM, there appears to be little variation in the rainfall-humidity relationships in this set of GCMs.
To quantify the relationship between this rainfall-humidity diagnostic and MJO hindcast fidelity, we compute the pattern correlation of the rainfall-humidity relationship shown in Figure 5 for each GCM against that in ECMWF-YOTC/TRMM. There is no useful correlation (r = 0.20, p > 0.20) between these pattern correlations and MJO fidelity (Figure 6a). The similarity in the pattern correlation values confirms the small inter-GCM variations in the rainfall-humidity relationship discussed above. We obtain similar results for the relationship between rainfall and vertical velocity, for which we present only the pattern correlation values (Figure 6b). There is more inter-GCM spread in this diagnostic than in the rainfall-humidity diagnostic, but there is still no meaningful correlation with MJO fidelity (r = 0.29, p > 0.20). We note that Jiang et al. [2015] found that the difference in 850-500 hPa integrated relative humidity between the bottom 10% and top 5% of rainfall events [Kim et al., 2014b;Maloney et al., 2014] was able to better discriminate between high-and low-fidelity GCMs in the 20 year climate simulations from this project; we discuss this further in section 4.

Method for Producing Composite Heating and Moistening Profiles
We choose to analyze temperature and specific humidity tendencies as a function of rain rate, by compositing tendencies by quartiles of GCM rain rates over a large horizontal domain: 60 • E-180 • and 10 • S-10 • N. This method differs from that used for the companion 2 day hindcasts in Xavier et al. [2015], in which the authors chose a smaller (5 • × 5 • ) region and composited tendencies by "suppressed, " "transition, " and "active" conditions according to the observed MJO phase in that region. Our method is more appropriate for the longer hindcasts in this study: it allows us to use all lead times from each GCM, regardless of the presence of an active MJO. Some GCMs damp the amplitude of the MJO sharply (Figure 2e), which would complicate analysis of tendencies by MJO phase. Choosing a small region might bias our analysis toward GCMs that happened to drift to their intrinsic climatologies less rapidly in that region than other GCMs; those same GCMs might drift more rapidly in other regions. Drift was not an issue for the 2 day hindcasts. To ensure that our analysis captured large-scale and not grid point variability, we first interpolated all rainfall and tendency data to a 10 • × 10 • horizontal grid; similar results were obtained for 5 • × 5 • and 20 • × 20 • grids.
Using all start dates and lead times, rain rate quartile thresholds were computed independently for each 10 • × 10 • box in the domain and for each 3 h period of the day (e.g., 00Z-03Z and 03Z-06Z), to avoid aliasing the diurnal cycle of convection onto the quartile thresholds. Quartiles were computed only from rain rate values >1 mm −1 . Table 3 lists the domain-and time-averaged rain rates in each quartile. In each 10 • ×10 • box, for each 3 h output time in all hindcasts, the linear trend of rain rates was computed with a 27 h sliding centered Figure 7. The composite total diabatic heating (K d −1 ) from GCM physics, excluding radiation, binned by (colors) quartiles of rain rates >1 mm d −1 and (black) all rain rates ≤1 mm d −1 . Composites are produced from three-hourly precipitation and heating rates on a 10 • × 10 • horizontal grid, using data from all hindcast start dates and (b-h) all lead times; statistics are accumulated over 10 • S-10 • N and 60 • E-180 • . Rain rate quartiles are computed separately at each grid point and for each 3 h phase of the diurnal cycle. Circles on the right-hand vertical axis show the mean rain rate in each quartile. (a) Computed from ECMWF-YOTC 0-23 h forecasts for the period of the 20 day hindcasts. window (i.e., using nine values of the three-hourly rain rate, centered on the current time). This trend was then correlated with the values in the window, to determine if the trend was statistically significant at the 5% level. This allowed us to determine, for example, whether a grid box with a rain rate in a transition (second or third) quartile was moving from heavier to lighter rain rates or from lighter to heavier rain rates. We initially created four composite profiles for each rain rate quartile, across (1) all cases in that quartile, (2) linearly increasing rain rate cases, (3) linearly decreasing rain rate cases, and (4) cases where the linear trend was not significant. These four types of composites produced highly similar results for all GCMs, indicating that the composites are insensitive to the 27 h linear trend of rainfall. We show only the composites for all cases (case (1) above); the project website contains versions of Figures 7-10 for all four types of composites. Composites were also computed for all rain rates ≤1 mm d −1 .
As for the precipitation-humidity and precipitation-vertical velocity relationships, we found only very small variations in the heating and moistening tendency composites with hindcast lead time. To maximize the sample size, we compute composites over all lead times and start dates.
We apply the same method to 3 h rainfall and temperature and specific humidity tendencies from the ECMWF-YOTC 0-23 h forecasts over the hindcast period, to compare the GCMs against the closest available approximation to observations of these relationships. We use ECMWF-YOTC rainfall, rather than 10.1002/2014JD022374 TRMM, because of the need for the rainfall to be collocated and contemporaneous with the heating and moistening tendencies.

Diabatic Heating
Before analyzing the diabatic heating, we note that the composite total rate of change of temperature (i.e., dT/dt) in all GCMs is very small, typically less than 0.1 K d −1 , regardless of rain rate (not shown). This is consistent with the weak temperature gradient approximation for the tropical atmosphere [e.g., Sobel et al., 2001;Bretherton and Sobel, 2003]. The composite tendencies in Figures 7 and 8 are balanced by a nearly equal and opposite composite advective tendency (not shown).

Nonradiative Diabatic Heating
We discuss first the composite total diabatic heating from the model subgrid physics, except for radiation ( Figure 7); radiative tendencies are presented in section 3.4.2. Heating profiles from individual schemes for all GCMs are available on the project website. Tendencies from individual schemes are often not directly comparable across GCMs, because of intermodel variations in which physical processes are represented in which parameterizations.
ECMWF-YOTC shows mean boundary layer heating at all rain rates, associated with turbulent fluxes (Figure 7a). Above the top of the boundary layer (925 hPa), heating rates tend to decrease with height in the first (driest) quartile and for rain rates less than 1 mm d −1 . In the second quartile, free-tropospheric heating rates are roughly constant to 400 hPa; the profiles become progressively more top-heavy as rain rates increase through the third and fourth (wettest) quartiles. There is a local minimum in heating around the freezing level 10.1002/2014JD022374 Figure 9. As in Figure 7 but for the total diabatic moistening (Q 2 ; g kg −1 d −1 ) from the GCM subgrid physics schemes.
(500-600 hPa) from the melting of frozen precipitation, which is also present to a greater or lesser extent in several other GCMs, including the IFS (Figure 7c Figure 7f ). The lower height of the "kink" in the MIROC5 heating profiles may be due to a warmer threshold temperature at which hydrometeors melt or to differences in the temperature profile. To quantify the "top-heaviness" of the heating profiles, Table 3 gives the fraction of heating above 550 hPa for all models and all quartiles. For all precipitation bands, ECMWF-YOTC produces the most top-heavy heating profiles of any model. CAM5-ZM, a higher-fidelity model, produces similarly shaped heating profiles to ECMWF-YOTC for the first and second quartiles, which have similar mean rain rates (Figure 7b). In the third and fourth quartiles CAM5-ZM shows more bottom-heavy heating than ECMWF-YOTC (Table 3); in the fourth quartile the CAM5-ZM profile is upright until the freezing level, whereas the ECMWF-YOTC profile shows sharp increases associated with stronger convergence. Heating rates in the third and fourth quartiles are lower overall in CAM5-ZM relative to ECMWF-YOTC, associated with smaller mean rain rates. SPCAM3 also produces similarly shaped profiles to CAM5-ZM (not shown), despite the lower MJO fidelity for these cases in SPCAM3.
It is not surprising that the heating profiles from the IFS, another high-fidelity model, are similar to ECMWF-YOTC, given that an earlier IFS version produced the ECMWF-YOTC forecasts. However, the IFS hindcasts show much less top-heavy heating profiles than ECMWF-YOTC with a stronger minimum at the freezing level in the fourth quartile. As for the rainfall-moisture relationship, this demonstrates that the IFS hindcasts produce different behavior to ECMWF-YOTC, likely due to both the difference in model version and drift away from the ECMWF-YOTC analysis.  Figure 7 but for the total rate of change of specific humidity (dq/dt; g kg −1 d −1 ). Note that the x axis has one fifth the range of the horizontal axis in Figure 9.
MRI-AGCM3 and CNRM-AM, two moderate-fidelity models, differ considerably in the structure and evolution of their heating profiles with rain rate. MRI-AGCM3 produces the most top-heavy heating profiles of any hindcast GCM (Table 3), while CNRM-AM shows upright heating profiles until the fourth quartile. There is also substantial inter-GCM variability among the lower fidelity models. MIROC5 produces profiles that are similarly top-heavy to CAM5-ZM and more top-heavy than IFS, suggesting that top-heavy profiles are not a distinguishing feature of higher-skill models. NavGEM1 (Figure 7g) has the most bottom-heavy profiles of any GCMs in this study, with "upright" profiles for even the most intense convection. CanCM4 (Figure 7h) shows profiles similar to MRI-AGCM3 and MetUM-GA3 (not shown) for heavy and moderate precipitation, but without a reduction near the freezing level. For the first quartile, however, the CanCM4 profile is upright, with weak but constant heating between 850 hPa and 300 hPa and 30% of the heating above the freezing level (Table 3), suggesting it produces deep convection even at 2 mm d −1 .
Despite the variations in the profiles noted above, we find no consistent features among higher-fidelity or lower fidelity GCMs. Most GCMs produce heating profiles that are similarly top-heavy for each rainfall quartile, at least when measured by the fraction of total heating above the freezing level (Table 3). NavGEM1 produces bottom-heavy profiles and poor MJO predictions, indicating that top-heavy heating may be a necessary but not sufficient condition; it is not possible to draw firm conclusions from only one GCM, however. Correlations between model skill and the fractions of heating above 400 hPa, 500 hPa, and 550 hPa, as well as the height of peak heating, for each rainfall quartile produced no statistically significant relationships.

Radiative Heating
When composited by rain rate, the shapes, magnitudes, and occasionally even the signs of the GCM radiative heating profiles disagree, both with one another and with ECMWF-YOTC (Figure 8). This is consistent with Xavier et al. [2015], who analyze discrepancies in radiative heating for many of the GCMs in this study, using 2 day hindcasts and timestep data. We discuss the radiative heating profiles to confirm that the inter-GCM variations exist on broader spatial and temporal scales (10 • ×10 • , 3 h data) and across a wider range of hindcast start dates.
ECMWF-YOTC, IFS, MRI-AGCM3, and CNRM-AM (Figures 8a and 8c-8e, respectively) generate radiative cooling profiles that, for a given precipitation quartile, are much more constant with height above the boundary layer than the other GCMs. In these GCMs, vertical variations in shortwave heating and longwave cooling roughly balance. The other models, CAM5-ZM, MIROC5, NavGEM1, and CanCM4 (Figures 8b and 8f-8h, respectively), decrease the magnitude of radiative cooling with height in the free troposphere, although there are substantial differences in the shape of the cooling profiles, and hence of the cloud profiles. These models show increasing shortwave heating and reduced longwave cooling with height, particularly at low rain rates, indicating lower cloud tops or optically thinner upper level cloud than ECMWF-YOTC, IFS, MRI-AGCM3, and CNRM-AM. ECMWF-YOTC, IFS, MRI-AGCM3, MIROC5, and, to a lesser extent, CAM5-ZM exhibit "kinks" in the radiative profiles at similar altitudes to corresponding features in the nonradiative profiles (Figure 7), suggesting the production of cloud liquid by detrainment near the freezing level. All GCMs disagree on the shape and magnitude of the upper tropospheric profiles for intense convection, presumably due to differences in cloud top heights and cloud water content.
The profiles composited on rain rates ≤1 mm d −1 (black lines in Figure 8) are effectively clear-sky profiles. ECMWF-YOTC, CAM5-ZM, IFS, and MIROC5 show relatively smaller deviations from these profiles with increasing rain rate; in SPCAM (not shown), the deviations become apparent only for the heaviest (fourth) rain rate quartile. MRI-AGCM3, CNRM-AM, and MetUM (not shown) produce somewhat stronger deviations, even in the first quartile, suggesting that they produce larger amounts of cloud during suppressed conditions. In all models, the profiles for the first, second, and third quartiles have a consistent shape, at least below 300 hPa, and change only in amplitude as rain rate increases. This implies preferred heights for cloud in these models, which do not change between relatively suppressed and active convective conditions.
As for the nonradiative profiles (section 3.4.1), the radiative profiles reveal substantial inter-GCM variations but do not distinguish between higher-and lower fidelity models. We also examined the composite total diabatic heating profiles (i.e., the sum of Figures 7 and 8), which also did not discriminate between higher-and lower fidelity GCMs.

Moistening Tendencies
We composite the vertical profiles of moisture (specific humidity) tendencies by rain rate as for diabatic heating (section 3.3), using all start dates and lead times. While the total rate of change of temperature (dT/dt) was uniformly nearly zero in GCMs, due to a balance between dynamics and physics, the total rate of change of moisture (dq/dt) is frequently nonnegligible. We analyze the implications of the net moistening and drying that results from the lack of local compensation between the dynamics and physics. We were unable to compute moisture tendencies for CAM5 due to an incomplete data set, but the results of the closely related CAM5-ZM are presented below. As for the temperature tendencies, composite moisture tendencies from all subgrid schemes for all GCMs are available on the project website.

Diabatic Moistening Tendencies
First, we analyze the diabatic moistening from the subgrid physics ("Q 2 , " which we define to be positive for moistening; Figure 9), which in GCMs is a balance between drying by condensation and moistening from surface fluxes, convective detrainment, and evaporation of cloud water and falling hydrometeors. For rain rates less than 1 mm d −1 and in the driest quartile, ECMWF-YOTC displays boundary layer and lower and midtropospheric moistening (Figure 9a). In the third and fourth quartiles, the GCM physics dries all but the lowest levels of the column. Moisture tendencies are mostly negligible in the second quartile, but there is some moistening in the boundary layer and near the freezing level and slight drying in the lower and upper troposphere. Higher-fidelity models tend to produce a similar evolution of the Q 2 moistening profiles to ECMWF-YOTC, as shown by CAM5-ZM and IFS (Figures 9b and 9c), however, so do two of the lower fidelity models: MIROC5 and SPCAM3 (Figures 9f and 9g).
Among the moderate-fidelity models, MRI-AGCM3 displays strong boundary layer drying and lower tropospheric moistening at all but the highest rain rates (Figure 9d). This is due to strong vertical transports of moisture by convection, even in suppressed conditions (not shown). Section 3.5.2 demonstrates that these tendencies are mostly compensated by advective drying. CNRM-AM has positive Q 2 above the boundary layer for only the first quartile and very light (≤1 mm d −1 ) rainfall, transitioning to a negative Q 2 at the second quartile (Figure 9e). CNRM-AM and two of the lower fidelity models, CanCM4 (Figure 9h) and NavGEM1 (not shown), produce free-tropospheric drying from physics above the boundary layer in the second quartile, as do IFS and ECMWF-YOTC, suggesting that this behavior is not a distinguishing feature of either higher-or lower skill models.
There is little consensus among either the higher-skill or the lower skill GCMs on the shape of the Q 2 profile or the rain rate at which free-tropospheric Q 2 shifts from net moistening to net drying. For instance, SPCAM3 and MIROC5 produce profiles for low rain rates that resemble IFS. We note that SPCAM3 produces the deepest moistening in the second quartile of any GCM except MRI-AGCM3, which behaves quite differently to the other models.

Total Moistening
Analyzing the total rate of change of moisture (dq/dt) by rain rate quartile produces the clearest distinction found in this study between higher-skill and lower skill GCMs. The dq/dt profiles ( Figure 10) are the sum of Q 2 (Figure 9) and the moisture tendencies from the GCM dynamics. All models produce net boundary layer moistening for rain rates ≤1 mm d −1 . In ECMWF-YOTC ( Figure 10a) and all hindcast GCMs except MIROC5 (Figure 10f ), this moistening extends into the lower troposphere (above 900 hPa). All models also show drying aloft, consistent with large-scale subsidence.
All higher-fidelity models produce net moistening in the lower free troposphere (i.e., above the boundary layer) in the first and second quartiles, as shown for CAM5-ZM and IFS (Figures 10b and 10c). The higher-fidelity models also produce midtropospheric net moistening in the second and third quartiles, similar to ECMWF-YOTC, although CAM5-ZM produces only very slight midlevel moistening in the second quartile. For IFS, a comparison with Figure 9c reveals that much of the 800-600 hPa moistening in the second quartile comes from the IFS dynamics, since the physics tendency is negligible or negative. Between the third and fourth quartiles, the higher-fidelity models transition to lower level net drying and upper level net moistening, also similar to ECMWF-YOTC.
The moderate-fidelity MRI-AGCM3 produces relatively small moistening tendencies (Figure 10d), which confirm that the large Q 2 values (Figure 9d) are typically compensated by opposite and nearly equal tendencies from the dynamics. MRI-AGCM3 produces moistening through much of the column in the first, second, and even third quartiles, suggesting that this behavior is not exclusive to the higher-fidelity models. In CNRM-AM, the composite profiles suggest that the dynamics consistently produces stronger moisture tendencies than the physics, with column drying in the first and second quartiles and column moistening in the third and fourth quartiles (Figure 10e). The dq/dt profiles have opposite signs to the Q 2 profiles (Figure 9e), indicating the dominance of dynamics over physics. This behavior bears little resemblance to the other GCMs in this project; we discuss these results further later in this section.
Among the lower fidelity models, SPCAM3 produces lower tropospheric moistening in the first quartile, but in the second quartile the tendency becomes negligible, with weak lower tropospheric (upper tropospheric) moistening (drying; Figure 10g). The moistening in the second quartile is weaker than IFS and somewhat weaker than MRI-AGCM. Since SPCAM3 produced positive lower tropospheric Q 2 in the second quartile (Figure 9g), the negligible dq/dt suggests a stronger compensation between advective drying and physics moistening in SPCAM3 than in IFS. As in IFS and MRI-AGCM, SPCAM3 produces low-level drying (from physics) and upper level moistening (from dynamics) in the third and fourth quartiles. MIROC5 ( Figure 10f ) and CanCM4 (Figure 10h) show very little free-tropospheric net moistening for rain rates above the first quartile. MIROC5 produces lower tropospheric net drying in the second quartile, while in CanCM4 dq/dt is nearly zero. Therefore, the higher-fidelity GCMs and some moderate-fidelity GCMs produce lower and midtropospheric net moistening for light rain rates, while the lower fidelity GCMs produce net drying or negligible tendencies.
To extend these results, we compute composites of the vertical profiles of dq/dt using the narrower rain rate bins from Figure 5 to more clearly show the transition from low-level (upper level) moistening (drying) to upper level moistening (drying) with increasing rain rate (Figure 11). We also show the compensation between Figure 11. Shading shows composites of the total rate of change of specific humidity (dq/dt; g kg −1 d −1 ) binned by the rain rates on the horizontal axis. The solid (dotted) lines are the zero contours of the specific humidity tendencies from the GCM dynamics (physics). Dynamic tendencies are positive (moistening) above and to the right of the solid line; physics tendencies are positive below and to the left of the dotted line. The dashed line shows a PDF of rain rates, using the right-hand vertical axis.

Figure 12.
MJO fidelity in the 20 day hindcasts plotted against the pattern correlation of composite net moistening binned by rain rate (Figure 11) against the same from ECMWF-YOTC net moistening and TRMM precipitation (Figure 11a). The least squares regression line and correlation coefficient are also shown. We do not assess MJO fidelity in NICAM, so we place its symbol on the horizontal axis and exclude it from the correlation and regression analysis. the moisture tendencies from the GCM dynamics and physics, by including the zero contours from dynamics and physics.
Among the hindcast GCMs, CAM5-ZM, IFS, MRI-AGCM3, and SPCAM3 (Figures 11b-11d  and 11g) clearly show a smooth progression from low-level moistening and upper level drying at low rain rates (≤2 mm d −1 ), through to midlevel moistening at moderate rain rates (2-9 mm d −1 ) and upper level moistening and low-level drying at heavy rain rates (≥9 mm d −1 ). This behavior agrees strongly with ECMWF-YOTC (Figure 11a). The moisture tendencies are somewhat weaker in MRI-AGCM3, as seen in Figure 10d), and the moistening is displaced vertically relative to GISS-E2, IFS, and SPCAM3, boundary layer moisture into the free troposphere (Figure 9d). At moderate rain rates, all four models and ECMWF-YOTC display moistening from both physics and dynamics between 850 hPa and 400 hPa, which strengthens the moistening at these levels. Several other higher-fidelity and moderate-fidelity GCMs also demonstrated this feature, including GEOS5, GISS-E2, and ECEarth3 (not shown).
Several GCMs with moderate or lower fidelity did not display either this progressive rising in the moistening level with rain rate or the enhancement of midlevel moistening by positive tendencies from both dynamics and physics. As discussed above, in CNRM-AM the dynamics moisture tendency is opposite to and frequently greater than the physics tendency; Figure 11e confirms this, with a sharp transition from net drying to net moistening near 7 mm d −1 , collocated with the zero-moistening contour of the dynamics tendency. The sharp transition may be caused by the lack of vertical tilt with rain rate in the zero-moistening contour from the dynamics; all other models show this tilt to a greater or lesser extent, as does ECMWF-YOTC. The sharp transition suggests an instability in CNRM-AM, where heavy (light) precipitation is associated with column moistening (drying). Curiously, the peak in the rain rate probability density function (PDF) (dashed line) is very close to the critical rain rate between column moistening and drying, implying that the model is finely balanced between these two unstable states. Jiang et al. [2015] found that 20 year CNRM simulations in atmosphere-only and coupled configurations showed large negative values of gross moist stability, implying that the large-scale circulation response to convection strongly moistens the column. MIROC5 and CanCM4 show fragmented patterns of moistening with rain rate, without much indication of a transition as rain rate increases (Figures 11f and 11h, respectively). NavGEM1, another lower skill model, displays similar behavior (not shown).
Although the association is not perfect, particularly for SPCAM3, we find that the higher-fidelity GCMs show clear and smooth progressions in the height of net moistening with increasing rain rate, including midlevel moistening at moderate rain rates of 2-9 mm d −1 . In most cases, this midlevel moistening comes from a combination of moistening from physics and dynamics. With the exception of SPCAM3, lower fidelity models show a nearly complete compensation between moisture tendencies from dynamics and physics, particularly at moderate rain rates, resulting in weak, if any, midlevel moistening. This results in lower fidelity models producing a fragmented progression from low-level (upper level) moistening (drying) at low rain rates to low-level (upper level) drying (moistening) at heavy rain rates.
To quantify the relationship between this diagnostic and MJO hindcast fidelity, we compute pattern correlations of the composite net moistening profiles shown in Figure 11 with ECMWF-YOTC ( Figure 11a) and plot the resulting correlations against RMM bivariate skill. Figure 12 demonstrates that there is a statistically significant correlation (r = 0.82, p = 0.001) between a model's fidelity in the net moistening diagnostic, relative to ECMWF-YOTC, and that model's MJO hindcast fidelity. However, we note that the moderate-fidelity models, which produced either 15 days or 16 days of skill, show pattern correlations ranging from 0.13 (CNRM-AM) to 0.71 (ECEarth3), which confirms that the net moistening diagnostic is not a perfect indicator of MJO fidelity.

Discussion
Since the 20 day hindcast component of the model evaluation project seeks to link the climate (20 year simulations)  and NWP (2 day hindcasts)  components (section 2.1), it is useful to relate our results to those from the other studies. Eleven GCMs from this study-all except IFS and NICAM-also produced 20 year simulations. Among those models, there is little correspondance between the hindcast skill evaluated here and the MJO fidelity assessed in Jiang et al. [2015]. Several models that performed well in these initialized hindcast simulations, such as CAM5, CAM5-ZM, and GEOS5, displayed weak or moderate fidelity in the 20 year climate simulations. Conversely, SPCAM3 is a lower fidelity model in the 20 day hindcasts but shows reasonable amplitude and propagation in the 20 year climate simulations.
While we found no significant correlation between MJO hindcast fidelity and the precipitation-moisture and precipitation-vertical velocity relationships (Figure 6), Jiang et al. [2015] identified a correlation between MJO amplitude and the difference in mass-weighted 850-500 hPa relative humidity between the bottom 10% and top 5% of daily rainfall events, as suggested by Kim et al. [2014b] and Maloney et al. [2014]. In a further manuscript that synthesizes and summarizes the conclusions of the entire project, we apply this relative humidity diagnostic to the 20 day hindcasts from the nine GCMs that submitted results to all the three components of the project [Klingaman et al., 2015]. That manuscript also discusses further the relationship between MJO fidelity in the 20 day hindcasts and 20 year climate simulations.
The analysis in Xavier et al. [2015] focuses on GCM behavior at very short lead times (12-36 h). Several of the key conclusions of that study, such as the strong intermodel discrepancies in radiative heating profiles, rely upon this behavior remaining consistent with lead time. Figure 8 confirms that even at 20 day leads, the GCMs differ considerably in the shape, magnitude, and even sign of the radiative heating profiles, particularly above the freezing level. Further, we have found that the heating and moistening profiles, as well as the precipitation-humidity relationships ( Figure 5), are robust features of the GCM physics schemes that are largely insensitive to lead time or to the presence of a strong MJO. This suggests that the time step tendencies analyzed by Xavier et al. [2015] are likely representative of the overall GCM behavior, although we have analyzed only 3 h averages, not time step data.
All three components have emphasized that shallow and midlevel moistening during the MJO-suppressed and transition phases may be critical to GCM representations of the MJO, which agrees with previous studies on the key role of shallow and midlevel convective clouds [e.g., Inness et al., 2001;Benedict and Randall, 2009;Cai et al., 2013]. Jiang et al. [2015] found that GCMs with strong, propagating MJOs showed heating profiles that tilted westward with height, with shallow heating and positive low-level moisture anomalies east of the active convection, similar to recent modeling studies with single GCMs [e.g., Lappen and Schumacher, 2014]. Xavier et al. [2015] highlighted that most GCMs produced too much rainfall in the MJO-suppressed phase and displayed strong midtropospheric dry biases in the transition phase, even at 12-36 h lead times, suggesting a lack of moistening by shallow or congestus convection. In this study, the highest-skill GCMs showed midlevel moistening at moderate rain rates, indicative of the transition phase, with positive contributions from both the GCM dynamics and subgrid physics. Other GCMs exhibited a much sharper transition from low-level (upper level) moistening (drying) at light-moderate rain rates to upper level (low-level) moistening (drying) for moderate-heavy precipitation. These results with full GCMs support the results of an SCM intercomparison [Woolnough et al., 2010], in which SCMs that produced too much precipitation and too little convective moistening in the suppressed phase had a later and sharper transition to active conditions, which disagreed with the CRM results. GCMs in this study that continued to moisten the lower and midtroposphere with increasing rain rate, by both convective and dynamical processes, produced superior hindcasts to those that showed negligible or negative moisture tendencies at similar rain rates.
This study did not find any significant relationship between the shape of the GCM diabatic heating profiles and their representations of the MJO, based on these hindcasts. Despite several recent modeling efforts that concluded a top-heavy heating profile for deep convection, produced by stratiform heating, was necessary to simulate the MJO [e.g., Fu and Wang, 2009;Seo and Wang, 2010], we found no correlation between the top-heaviness of the profiles and MJO fidelity. All GCMs except NavGEM1 produced similarly top-heavy profiles; the IFS produced one of the least top-heavy profiles but relatively high MJO fidelity (Table 3). The strongest conclusion possible is that a top-heavy profile is a necessary but not a sufficient GCM feature for representing the MJO. The potentially greater importance of shallow convection and low-level moistening ahead of the MJO over a top-heavy heating profile agrees with the results of detailed experiments of the sensitivity to heating structure in CAM Schumacher, 2012, 2014].
Even the association between midtropospheric moistening during the suppressed and transition phases and MJO prediction is far from perfect, however. Despite the amount of detailed model output collected, it has not been possible to identify a single, unifying feature of either the highest-skill or lowest-skill GCMs. There are two likely explanations. First, there may be several or even many factors that are individually necessary for capturing the MJO in a GCM, but which are sufficient for producing an MJO only when a GCM includes all of them. This is equivalent to hypothesizing that a GCM can fail to represent the MJO for any one of a wide range of reasons. For example, SPCAM3 produces a top-heavy heating profile (Table 3), low-level moistening from convection in the suppressed phase (Figure 9g), and a gradual increase in the height of net moistening with increasing rain rate ( Figure 11g) yet has relatively low MJO fidelity for these cases (Figure 2). The lack of skill is likely due to an overly high RMM amplitude (Figure 2e) from too strong vertical shear caused by a lack of momentum transport by convection. In NavGEM1, the poor MJO may be caused by its bottom-heavy heating profile ( Figure 7g); in CanCM4 it may be caused by its lack of midlevel moistening and sharp transition from suppressed to active conditions (Figure 11h). These models may simulate other diabatic heating and moistening processes well, which complicates finding a single process that only the highest-skill models capture. We recommend that these deficiencies should form the basis for further experiments by the participating modeling centers, using the experiment design developed in this project, in which one aspect of the GCM physics (e.g., the triggering of convection) is varied while all other aspects are held constant. These experiments are more likely to produce robust conclusions about the relationship between individual aspects of a GCM physics schemes and MJO fidelity than a large model intercomparison project such as this.
An alternate explanation for the lack of a single, discriminating diagnostic is that this study has not robustly assessed MJO prediction skill in these GCMs. We have examined only two MJO cases using 94 start dates. Certain GCMs that performed well or poorly for these cases may not maintain that performance over a much wider hindcast set. It is curious, for instance, that CAM5 and CAM5-ZM produce such skillful hindcasts when Jiang et al. [2015] found that they had weak MJO activity in a 20 year simulation. It may be that these GCMs can produce useful predictions when initialized with a strong MJO [e.g., Miura et al., 2007;, as most start dates have RMM amplitude ≥1 (Figure 1), but it may also be that Cases E and F are not a representative sample of the models' performance. A larger hindcast set was impractical because of the need to collect detailed, high-frequency output of GCM tendencies. We have tried to link the simulated diabatic processes in these cases to how accurately the GCMs predicted them, but we acknowledge that both the processes and the prediction skill may vary among MJO events.
Our conclusions about CanCM4, which we determined to be a lower skill model, may be influenced by the different initialization procedure used for that model, due to its coupled configuration (section 2.3). However, qualitative analysis of Hövmoller diagrams of CanCM4 OLR and zonal wind showed little evidence of an eastward propagating MJO beyond about 5 days lead time (not shown; available on the project website). We note that Jiang et al. [2015] found that CanCM4 had very weak MJO activity in its companion 20 year simulations.

Summary and Conclusions
This study has analyzed the 20 day hindcast component of the "Vertical structure and diabatic processes of the MJO" global model evaluation project. The aims of this component are (a) to link MJO hindcast fidelity to simulated diabatic heating and moistening processes and (b) to bridge the other two components of the project, which comprise 20 year climate simulations  and 2 day hindcasts , by examining the transition in GCMs between the YOTC analyses used as initial conditions and the models' own mean states (section 2.1). Thirteen GCMs provided 94 hindcasts, initialized once per day, of two strong MJO events during YOTC (Cases E and F; Figure 1). Additional GCM contributions were received using the high-resolution, explicit convection NICAM model-eight hindcasts per case-and MetUM-15 day coupled and atmosphere-only hindcasts initialized from MetUM analyses. Two LIM hindcast sets were provided and used as baseline measures of MJO skill, as an improvement over simple persistence.
From the bivariate correlation of simulated and observed Wheeler and Hendon [2004] RMM indices (Figures 2), we ranked the GCMs in the order of hindcast skill (Table 2), which is reflected in the arrangement of the figure panels in this manuscript. We stress that this ranking is not an assessment of the overall performance of these models for predicting the MJO but reflects only their ability to hindcast these two cases. Although all but one GCM had skill above persistence, several lower skill models were broadly similar to the two LIMs. We conducted the remainder of our analysis in the context of this ranking, to associate MJO fidelity with GCM representations of particular processes.
We found no relationship between MJO skill and the sensitivity of precipitation to environmental moisture ( Figure 5), which other studies had suggested was key to representing the MJO in GCMs [e.g., Thayer-Calder and Randall, 2009;Xavier, 2012;Klingaman and Woolnough, 2014]. We also found no link between MJO skill and the shape of the rain rate PDF or the sensitivity of precipitation to vertical velocity (section 3.2).
Likewise, most of the GCMs, regardless of skill, produced similar diabatic heating profiles from their physics schemes, when analyzed as a function of rain rate (section 3.4.1 and Figure 7). All GCMs except NavGEM1 displayed similarly top-heavy heating profiles for the heaviest quartile of precipitation; NavGEM1 had bottom-heavy profiles and low skill (Table 3). There were no significant correlations between the level of MJO skill and the fraction of heating above the freezing level (550 hPa) or above 400 hPa, despite recent studies that suggested that elevated, stratiform heating was essential to simulate the MJO [e.g., Fu and Wang, 2009;Seo and Wang, 2010]. It is possible that top-heavy heating is a necessary but not sufficient condition for producing an MJO; with only one GCM with bottom-heavy profiles, this experiment cannot confirm or reject that hypothesis. We note that Jiang et al. [2015] found no correlation between the stratiform rainfall fraction in GCMs and MJO fidelity. There were large discrepancies in the radiative heating profiles in these models, particularly above the freezing level, demonstrating that similar conclusions reached by Xavier et al. [2015] are not limited to short lead times or the presence of an active MJO in the initial conditions.
The diagnostic most able to discriminate between the highest-and lowest-skill GCMs was the net moistening (i.e., dq/dt) as a function of rain rate (Figures 11 and 12). While all models produced lower tropospheric moistening for low rain rates (generally ≤2 mm d −1 ) from their boundary layer and convection schemes, the highest-skill GCMs continued to moisten the lower and middle tropospheres as rain rates increased into the second and third quartiles (2-8 mm d −1 , Figure 9). In the midtroposphere, the highest-skill GCMs have moistening from both their dynamics and subgrid physics, which enhances the moistening at these levels relative to the lowest-skill models, which typically moisten only from their dynamics at these rain rates ( Figure 11) and often show negative or near-zero net dq/dt due to strong drying from condensation ( Figure 10). This finding supports the results of a previous SCM intercomparison of the MJO-suppressed phase during TOGA COARE [Woolnough et al., 2010]. It suggests that moistening from shallow or congestus convection during the MJO-suppressed and transition phases is critical to the maintenance and propagation of the active phase. This moistening is often missing in GCMs with poor MJOs, leading to midtropospheric dry biases [e.g., Xavier, 2012], which Xavier et al. [2015] found developed within the first 12-36 h in many of the GCMs used here. Our results also agree with the conclusion of Jiang et al. [2015] that GCMs with strong MJOs in 20 year climate simulations have heating profiles that tilt west with height, with shallow heating leading (east of ) the active MJO.
Even the association between prediction skill and moistening in the MJO-suppressed and transition phases was far from perfect, however, as some models (e.g., SPCAM3) performed reasonably well by this measure but produced relatively lower MJO fidelity. From this set of hindcasts, which were limited in scope by the need to collect high-frequency output of GCM tendencies, we found no process or diagnostic which all of the higher-fidelity models captured well and all of the lower fidelity models captured poorly. While this may be due to the limited sample of hindcasts, it likely emphasizes that GCMs can fail to predict to the MJO for any one of the many reasons, which no single diagnostic can capture.