Solid Earth Open Access Open Access Open Access

Abstract. Simulations using IPCC (Intergovernmental Panel on Climate Change)-class climate models are subject to fail or crash for a variety of reasons. Quantitative analysis of the failures can yield useful insights to better understand and improve the models. During the course of uncertainty quantification (UQ) ensemble simulations to assess the effects of ocean model parameter uncertainties on climate simulations, we experienced a series of simulation crashes within the Parallel Ocean Program (POP2) component of the Community Climate System Model (CCSM4). About 8.5% of our CCSM4 simulations failed for numerical reasons at combinations of POP2 parameter values. We applied support vector machine (SVM) classification from machine learning to quantify and predict the probability of failure as a function of the values of 18 POP2 parameters. A committee of SVM classifiers readily predicted model failures in an independent validation ensemble, as assessed by the area under the receiver operating characteristic (ROC) curve metric (AUC > 0.96). The causes of the simulation failures were determined through a global sensitivity analysis. Combinations of 8 parameters related to ocean mixing and viscosity from three different POP2 parameterizations were the major sources of the failures. This information can be used to improve POP2 and CCSM4 by incorporating correlations across the relevant parameters. Our method can also be used to quantify, predict, and understand simulation crashes in other complex geoscientific models.


Introduction
Modern global three-dimensional climate models are extraordinarily complex pieces of science (e.g., Randall et al., 2007;Gent et al., 2011;The HadGEM2 Development Team, 2011) and software engineering (Easterbrook et al., 2011;Rugaber et al., 2011;Easterbrook, 2010).They contain over a million lines of code (Easterbrook and Johns, 2009;Easterbrook, 2012) and use hundreds to thousands of files, functions, and subroutines to solve equations of state and conservation laws for the flows of matter, energy, and momentum within and between the atmosphere, oceans, land, and other reservoirs of the Earth system (Washington and Parkinson, 2005).They also use numerous algorithms of biological, chemical, geologic, and anthropogenic processes to simulate the cycles of carbon, nitrogen, sulfur, aerosols, ozone, greenhouse gases, and other climate-relevant quantities of interest.To compound this complexity, these algorithms operate across many orders of magnitude in space and time, and contain constituents that exist in gas, liquid, solid and mixed phases.
Given this enormous range of scientific complexity, climate models are vulnerable to many types of software design and implementation issues.Climate models are developed in a manner analogous to large open source and agile software projects (Easterbrook and Johns, 2009).Based on current best understanding, small groups of scientists create, test, and refine modules for select climate processes or sub-systems (e.g., atmospheric convection or aerosol microphysics).Their software changes are committed upstream to the climate model codebase, and the cycle is repeated until the model simulations reproduce desired features (i.e., D. D. Lucas et al.: Failure analysis of climate simulation crashes model validation).Varying amounts of software testing are conducted throughout the cycle, but formal code verification practices (e.g., see D 'Silva et al., 2008) are only recently starting to be considered for climate model development (Clune and Rood, 2011;Farrell et al., 2011).Nonetheless, the concentration on sound science, as opposed to software correctness, has led to climate models that contain fewer software defects than other comparably sized projects (Pipitone and Easterbrook, 2012).
Software issues aside, many potential problems still arise with scientific representations in complex models.As code verification can be used to find software bugs, emerging tools being developed in the field of uncertainty quantification (UQ) (see National Research Council Report, 2012) can help pinpoint scientific discrepancies in simulation models, the knowledge of which can be used to guide and improve model development.Primary UQ targets for climate models are schemes containing parameters with adjustable values.These schemes represent physical processes that are not fully understood or cannot be directly simulated at the model resolutions of interest (e.g., Stensrud, 2009).Parameterizations like this are often developed in isolation, so they can respond in unexpected ways when inserted in nonlinear climate models and coupled to other parameterizations.Small perturbations to the values of the adjustable parameters can amplify and lead to large changes in simulation outputs.In some cases, the simulations may fail altogether.
We report here on a series of simulation crashes encountered while running perturbed parameter UQ ensembles of the Community Climate System Model Version 4 (CCSM4) (Gent et al., 2011;CCSM4, 2012).Treating the simulation crash problem as a black box in which we know only the values of the input parameters and a binary outcome flag indicating whether the simulations ultimately failed or were completed, information that does not require detailed scientific knowledge, we present a method that successfully predicted crashes in independent simulations and pinpointed the model parameters that caused the failures.
Numerous studies have applied UQ techniques to climate models similar to CCSM4 (e.g., Murphy et al., 2004;Stainforth et al., 2005;Jackson et al., 2008;Sanderson, 2011;Shiogama et al., 2012).However, analogous simulation failures and crashes have been reported far less often, though we suspect that they occur more frequently than indicated by the relatively limited number of documented cases (i.e., a reporting bias).Failures, crashes, or bifurcations have been reported for climate models of both intermediate complexity (Webster et al., 2004;Annan et al., 2005;Edwards et al., 2011) and full complexity (see supplementary discussion in Stainforth et al., 2005).The failures in these cases were attributed to numerical instabilities, or to particular climate phenomena, such as the collapse of the Atlantic meridional overturning circulation or accelerated changes through positive feedbacks in the models.Webster et al. (2004) and Edwards et al. (2011) present methods for calculating the probability of failure for an input set of parameter values.In addition, Edwards et al. (2011) use the failure probability to design and prescreen ensemble members in follow-up ensembles.
Our analysis and approach are similar to Edwards et al. (2011), but with the added challenges of applying them to a climate model system that is computationally more demanding, uses smaller ensemble sizes, has more parameter uncertainty dimensions, and exhibits fewer simulation failures.To help overcome these challenges, we use machine learning algorithms to calculate and predict the failure probability.As climate models and other geoscientific codes become more complex and UQ studies more commonplace, we fully expect parameter-induced simulation crashes to occur in these models with a greater frequency.Our failure analysis method will be beneficial for quantifying and determining the causes of these crashes.

Overview of climate simulations
Different sets of perturbed parameter UQ ensembles were executed as part of a broad effort to quantify and constrain uncertainties in the atmospheric, sea ice, and ocean model components of CCSM4 (Gent et al., 2011).The failures reported here occurred during simulations that perturbed parameter values in the Parallel Ocean Program (POP2), the ocean component of CCSM4.For these experiments, POP2 was coupled with the sea ice model, while data-based components were used for the land and atmosphere.The simulations were integrated for 10 yr, and the system was forced with climatological air-sea flux data using normal year forcing from Large and Yeager (2009).Further details about POP2 and the UQ ensembles are given below.

Ocean model and parameters
POP2 is a state of the art depth-level model of the general ocean circulation that solves the 3-D primitive equations of rotational fluid dynamics and thermodynamics with standard approximations of Boussinesq and hydrostatics.It is developed and maintained at Los Alamos National Laboratory (Smith et al., 2010) and is the ocean component of CCSM4 developed at the National Center for Atmospheric Research (Gent et al., 2011;Danabasoglu et al., 2012).The current simulations use the displaced-pole coordinate grid with the pole centered over Greenland and have a nominal horizontal resolution of 1 • .Vertically it resolves 60 depth levels with resolution varying from 10 m in the upper ocean (surface to 200 m) to 250 m in the deeper ocean.Refer to Smith et al. (2010) and Danabasoglu et al. (2012) for more information.
The ocean model parameters perturbed in this study were selected by POP2 model developers.They are used in six different sub-grid scale parameterizations to simulate the effects of horizontal and vertical turbulent mixing in the oceans.The a Individual corr parameters (numbers 1, 7, 9, and 13) are used to represent the correlated pair of parameters given in the description.For example, values drawn for vconst corr are assigned to vconst 1 and vconst 6. b Linear and logarithmic scales are used for parameter ranges that have ratios of high/low < 5 and high/low ≥ 5, respectively.
parameters and their uncertainty ranges are summarized in Table 1.Parameters 1-6 are used to capture horizontal mixing of momentum with spatially anisotropic viscosity (Large et al., 2001;Smith and McWilliams, 2003).Parameters 7-9 are used for horizontal mixing of tracers via isopycnal eddy-induced transport (Gent and McWilliams, 1990).Parameters 10-12 are used in recently developed schemes to simulate submesoscale and mixed-layer eddies (Fox-Kemper et al., 2008) and abyssal tidal mixing (Jayne, 2009).Parameters 13-18 are used for vertical convection and vertical mixing with the K-profile parameterization (KPP) scheme (Large et al., 1994).

UQ ensembles
Table 2 summarizes the UQ ensemble simulations.Three separate ensemble studies were conducted, each consisting of 180 simulations.The table also indicates the contributions of the studies to different types of analysis.For instance, studies 1 and 2 were used to train machine learning algorithms to learn about simulation crashes (see Sect. 4.3), while study 3 was used to test the ability to predict simulation crashes (see Sect. 4.4).Out of 540 total simulations, there were 46 failures, with the failures occurring at various times during the integration period.Each of the three studies used a Latin hypercube method to sample the values of the 18 POP2 parameters and a different random seed to generate the ensemble.The model parameters were represented using standard uniform or log-uniform probability distribution functions normalized to [0, 1] using the ranges (low and high values) and scales (linear and logarithm) noted in Table 1.
The Latin hypercube method is a stratified, space-filling variant of Monte Carlo sampling that is used extensively in UQ and uncertainty analysis (McKay et al., 1979;Helton and Davis, 2003).This sampling approach has superior variance reduction properties over standard Monte Carlo sampling for some problems (Stein, 1987).For an ensemble size of N, Latin hypercube splits each of the D parameter distributions into N intervals of equal probability, resulting in a multi-dimensional grid with N D separate bins.For our case, D = 18 and N = 180.Ensemble members are obtained by selecting parameter values from different bins chosen at random, with the important constraint that the bins are chosen so that each interval along every parameter dimension is sampled only one time per ensemble.An example of a five-member Latin hypercube ensemble for two parameters is given by bins with indices (1, 4), (2, 2), (3, 5), (4, 3), and (5, 1), which is one of 120 valid possibilities for this configuration.For our three UQ ensembles, the actual Latin hypercube sample points are illustrated in Figs. 1 and 2 for four of the POP2 parameters.These figures show that the Latin hypercube sample point coverage is uniform and dense in one and two dimensions.
Ensembles were generated using the Lawrence Livermore National Laboratory UQ Pipeline (Walter, 2010;Tannahill et al., 2011), which is a Python-based scientific workflow system for running and analyzing concurrent UQ simulations on high performance computers.Using a simple, non-intrusive interface to simulation models, it provides strategies for sampling high dimensional uncertainty spaces, analyzing ensemble output, constructing surrogate approximation models (e.g., Forrester et al., 2008),  incorporating observational data, performing statistical inferences, and estimating parameter values and probability distributions using maximum likelihood and Bayesian methods.
Of the many capabilities provided by the UQ Pipeline, the failure analysis presented here uses the simulation parameter values and a method for calculating parameter sensitivities.high values of convect corr is also apparent.The analysis presented in following sections does not require a detailed understanding of the physical reasons that connect parameter values to simulation failures, though we briefly summarize the connections to help with the interpretation.The parameters vconst corr and vconst 2 are part of the anisotropic horizontal viscosity parameterization applied to the momentum equations in POP2 (Smith et al., 2010).Their values are subject to three main constraints, considering the physical processes and limitations to maintain numerical stability; their lower bounds are constrained by the grid's Reynolds number representing the ratio between advection and diffusion, the Munk boundary layer constraint is needed to represent western boundary currents (Jochum et al., 2008), and their upper bounds are limited by a linear diffusion stability requirement specified by a viscous Courant-Friedrichs-Lewy (CFL) condition, which depends on the integration time step (one hour in this study) and grid resolution (Griffies, 2004;Large et al., 2001).High values of these parameters may trigger the limit set by the CFL condition and is the likely reason for the model failures seen in the experiments.The bckgrnd vdc1 parameter is used to set the background diffusivity for diapycnal mixing from internal waves in the KPP vertical mixing parameterization (Large et al., 1994).Reducing the values of bckgrnd vdc1 and other bckgrnd parameters increase the numerical noise in the solution and consequently cause numerical instability.Similarly, increasing the value of convect corr, which increases diffusivity and viscosity in the implicit KPP vertical mixing scheme, leads to instabilities in the vertical density profile.For detailed descriptions of all of the POP2 parameters used in the current study, please refer to Smith and McWilliams (2003), Large et al. (2001), andDanabasoglu et al. (2012).

Descriptive failure analysis
In spite of the obvious relationships between the parameter values and simulation outcomes, other features present in the figures suggest that the ability to determine the causes of the failures is potentially complicated.Figure 2, for instance, indicates that there are strong correlations between failed simulations and pairs of parameter values.As one example, failures occur at the combination of high values of vconst corr and low values of backgrnd vdc1.These two parameters reside in different modules in POP2 (hmix aniso, and vmix kpp, respectively), which makes it difficult for POP2 model developers and users to discover and attribute simulation failures to correlations in these parameters.
A more important complication arises from the overlap of simulation successes and failures in the low dimensional projections shown in the figures.Some simulations appear to fail in the same general vicinity of parameter space where other simulations succeed, and vice versa.To illustrate, the upper right portion of the scatterplot between vconst corr and vconst 2 in Fig. 2 contains a high density of failures and successes.Another notable example is the isolated failure event shown in the lower left hand corner of the same scatterplot.
These overlaps can lead to serious misclassification errors in statistical models used to predict failures as a function of parameter values.Two types of misclassification errors can occur.Simulations that are predicted to fail, but actually succeed are false positives or type I errors; those that are predicted to succeed, but actually fail are false negatives or type II errors (see Sect. 4 for further details).Imbalanced data, in which the population of one class greatly outnumbers the populations of other classes, are associated with class overlap (Prati et al., 2004), and the POP2 outcomes are highly imbalanced (i.e., 46 failures out of 540 simulations).Another related explanation is that higher parameter dimensions, and possibly a non-linear decision boundary, are required to effectively separate the outcomes.
Statistical approaches more powerful than the descriptive relationships illustrated in Figs. 1 and 2 are therefore needed to attack our problem.As described in the remaining sections, we turned to algorithms and diagnostics developed in the fields of pattern recognition, machine learning, and signal detection.These methods provided us with the ability to predict simulation failures in advance of running the model and a tool to quantify the causes of the failures.This latter capability can be used to improve POP2 by making it more robust to parameter changes.

Probabilistic failure classification
For a given set of model input parameters, a POP2 simulation will either succeed or fail.We denote these outcomes by a two-class categorical variable in which failures belong to class C f and successes belong to class C s .The present discussion considers only a single failure class, but we recognize that simulations can fail for a variety of reasons (e.g., lack of iterative convergence, numerical instabilities, etc).Without difficulty, the two-class methodology described below can be extended to handle multiple modes of failure through multiclass classification.
Our goal for probabilistic failure classification is to determine the probability that a POP2 simulation will fail for a vector of model input parameters x = (x 1 , x 2 , . . ., x 18 ).We denote this using the conditional probability P(C f |x).Using Bayes' rule, the posterior conditional probability can be written where P(x|C i ) and P(C i ) correspond to class-conditional densities and class priors, respectively.By introducing a variable λ representing the natural logarithm of the likelihoododds ratio, Eq. ( 1) can be rewritten as the "S-shaped" logistic sigmoid function The λ term is a function of x and takes values in (−∞, ∞).
As illustrated in Fig. 3, the sigmoid function is bounded between 0 and 1, inclusive.This formalism provides a mechanism to transform an input vector of model parameter values to a probability that the corresponding simulation will fail or succeed.
It is also useful to mention that the methods presented here can be applied to a much broader set of problems.Any geoscientific model output that varies continuously is amenable to probabilistic failure analysis by thresholding or discretizing the output, and then classifying the "failures" and "successes" as cases that fall on different sides of the threshold or fall within different bins.For instance, if a 5 K difference in global average surface temperature between a climate model simulation and a reference case is deemed excessive, then climate model instances above and below this threshold can be categorized as failures and successes, respectively.Equation ( 3 case would "fail" by simulating temperatures that exceed the 5 K threshold, information that developers could use to improve their models.

SVM classification
Support vector machine (SVM) classification (Vapnik, 1995;Cortes and Vapnik, 1995;Burges, 1998) from the fields of pattern recognition and supervised machine learning (Bishop, 2007;Kotsiantis, 2007) is used to assign a simulation to class C f or C s for input vector x.This type of classification problem can also be handled using other methods, such as logistic regression (Hosmer and Lemeshow, 2000), neural networks (Bishop, 2007), decision trees (Breiman et al., 1984), and random forests (Breiman, 2001).We limit our attention to SVMs, however, given our familiarity with the algorithm and its excellent performance on our climate model application.
Briefly, the SVM method is based on maximizing the distance between parallel hyperplanes that separate the classes (i.e., the margin), while allowing for misclassifications from overlapping data points during training (i.e., a soft margin).For non-linearly separable classes, the hyperplanes are determined by transforming the input space to a higherdimensional feature space using kernel functions.The purpose of the transformation is to make it easier to separate the classes, as illustrated conceptually in Fig. 4. The support vectors are the training points that lie on the margin for classes that are separable, and lie on or within the margin for classes that are not.New input vectors x are assigned to a class using the sign of the predictive decision function: where f (x) > 0 and f (x) < 0 are assigned to classes C f and C s , respectively.The sum in Eq. ( 4) is over the N s support vectors from the training set, y i ∈ {−1, 1} is a binary outcome indicator variable, K(x i , x) is the kernel function, and b and α i are, respectively, bias and Lagrange multiplier terms determined through constrained optimization of the margin.
Refer to Burges (1998), Bishop (2007), or Chang and Lin (2011) for further details.The decision function in Eq. ( 4) assigns inputs to a class, but does not provide a probability of class membership.An extension to the standard SVM approach was therefore developed (Platt, 1999) that derives class probabilities by fitting λ in Eq. ( 2) to a two parameter function using the training data and cross validation.A variation of this procedure is implemented in the LIBSVM package (Chang and Lin, 2011), which we used to fit SVM classifiers that calculate the failure probabilities P(C f |x).
As described in more detail in Sect.4.3, we also applied an ensemble learning approach (Dietterich, 2000) to create a "committee" of SVM classifiers.Each classifier in the committee contributes a vote (i.e., failure probability), and the votes are tallied in different ways to predict simulation failures and to assess the performance of the classification system.

Classification performance
Using a format known as the "confusion matrix" in machine learning, Fig. 5 summarizes the four possible outcomes for the two-class simulation failure problem.Classifiers that correctly predict actual failures and successes are labeled true positives (TP) and true negatives (TN), respectively; those that incorrectly predict actual failures and successes are denoted false negatives (FN) and false positives (FP), respectively.Of the numerous ways to combine these outcomes to assess classifier performance, we focus on the true positive rate (TPR) and false positive rate (FPR), which are given by the expressions

Discussion
and Perfect classifiers have TPR and FPR values of 1 and 0, respectively.As noted previously, we use SVM classifiers that provide probabilities of class membership.The assignment to a particular class, and resulting TPR and FPR values, therefore depends upon a specified decision variable and threshold value.If decisions are made using the failure probability with a threshold of 0.5, for example, then probabilities above and below this threshold will be assigned to classes C f and C s , respectively.
The quantities in Eqs. ( 5) and ( 6) are combined into a convenient diagram used in signal detection and decision analysis known as a receiver operating characteristic (ROC) curve (Swets, 1988;Fawcett, 2006).ROC curves plot the FPR (horizontal axis) versus TPR (vertical axis) of a decision variable as the threshold is varied from +∞ to −∞.A perfect classifier is represented in ROC space by the vertical line connecting points (0, 0) and (0, 1), followed by the horizontal line connecting points (0, 1) and (1, 1).A classifier that makes random assignments, on the other hand, is represented by the diagonal line connecting points (0, 0) and (1, 1).The predictive capability of a classification system can therefore be assessed by a single number, the area under the ROC curve (AUC) (e.g., Marzban, 2004).As a rough rule of thumb, a classifier with an AUC score of about 0.8 or higher is useful for discrimination.The AUC score is used in following sections to train SVM classifiers and to test their performance on independent simulation failure data.

Supervised learning of simulation failures
The three UQ studies listed in Table 2 were performed in succession to one another.After completing studies 1 and 2, but before starting study 3, we trained a committee of SVM classifiers to learn about the simulation failures.The goal was to use the committee to predict the outcomes of study 3 before those simulations even began to run.This section describes the training procedure, while the following two sections describe the performance on study 3.
The training set consisted of the 360 simulations from studies 1 and 2, which had 32 simulation failures and 328 successes.Given the relatively small ratio of the number of failure events to the number of classifier inputs (i.e., 32 / 18), we utilized an ensemble learning approach (Dietterich, 2000) known as bootstrap aggregation (i.e., "bagging") (Breiman, 1996).Bagging quantifies the variability of classifier predictions and can help improve the overall classification performance.The bagging was applied by resampling the training data N b = 100 times, allowing for duplicates in the samples (i.e., sampling with replacement).This step effectively created 100 different versions of the training data that were used to construct the committee of individual SVM classifiers.Failure predictions were then made by aggregating the votes across the committee, using an equal weight for each classifier.In particular, we computed the mean value (µ c ) and standard deviation (σ c ) of the committee, and and then combined these quantities in different ways to form decision variables for prediction and ROC analysis (see Sects.4.4 and 4.5).
The LIBSVM package (Chang and Lin, 2011), which is freely available and open source, was used to train each of the classifiers in the committee.LIBSVM offers two versions of SVM classification (C-support and ν-support) and four standard types of kernel functions (linear, polynomial, Gaussian, and hyperbolic tangent).On the basis of familiarity and experience, we employed C-support classification with Gaussian kernels, K(x i , x) = exp(−γ x i − x 2 ), though we subsequently tested other kernels (see below).The values of two adjustable SVM-related parameters, the kernel width γ and misclassification penalty C, were determined using a cross validation method (Arlot and Celisse, 2010).For each bootstrap replicate of the training dataset, we randomly selected 80 % of the data to construct an individual classifier and used the remaining 20 % to test that classifier.The objective was to find values for γ and C that are the same for all of the classifiers and that maximize their performance on these held-out tests.This task was accomplished by combining the individual tests into a large cross validation test set with 7200 data instances (0.2 test fraction ×360 simulations ×100 resampling size), and then computing the ROC curve and AUC score on this data.Figure 6 shows the ROC curve using the values that maximized the AUC (γ = 0.1, C = 3, and AUC = 0.93).The area under the curve is well above 0.8, which indicated that the SVM committee could be used for predicting simulation crashes in study 3.
The analysis presented throughout the manuscript utilized only the Gaussian kernels.However, to test the sensitivity to the choice of SVM kernel, we subsequently re-trained the classifiers using the same training data and cross validation technique, but instead applied linear, cubic, and hyperbolic tangent kernel functions.The Gaussian kernels performed slightly better than the other kernels, but all of the kernels still achieved cross validation AUC scores above 0.92.This test suggests that the C f and C s classes are primarily linearly separable, because the linear kernel performed nearly as well as the nonlinear kernels.

Predicting simulation failures
Before running the 180 simulations in study 3, we used the SVM classifier committee trained from studies 1 and 2 to predict simulation failures in study 3.These predictions were sent out to group members by email at the beginning of the study (see fourth and fifth columns in Table 3) and were largely validated by the end of the study.The predictions were based on Eqs. ( 7) and ( 8).Simulations were assigned to the C f class using decision criteria denoted by Two initial criteria were selected using the same threshold, but different decision variables.The first criterion used the committee average, while the second used the sum of the committee average and standard deviation, The second criterion was chosen to account for variability across committee members by categorizing some simulations as C f even though they had a committee mean below 0.5.After all of the simulations were completed, a third criterion based on the signal-to-noise ratio from the committee, D snr , was also considered and analyzed (see Sect. 4.5).
The predictions and actual outcomes are summarized in Table 3 and displayed in Figs.7 and 8.As noted in Table 2, there were 14 actual simulation failures and 166 successes in the study.The classifier committee performed exceedingly well using the two initial criteria, D avg and D sum .Referring to the confusion matrix in Fig. 7, both made 174 correct predictions (TP + TN, 96.7 % accuracy) and only six incorrect predictions (FP + FN).The incorrect predictions, however, are distributed differently for the two criteria.FNs than FPs, while D sum had an equal number of each.Because of this difference, D avg and D sum operate at different points in ROC space.D avg has ROC coordinates of (1/166, 9/14), while D sum operates at (3/166, 11/14).Based on their Euclidean distance from a perfect classifier, which is given by FPR 2 + (TPR − 1) 2 1/2 , we conclude that D sum (distance = 0.215) performs better than D avg (distance = 0.357).
To ascertain the cause of the performance difference between D sum and D avg , the top and middle panels in Fig. 8 display the actual outcomes and predictions using the µ c and µ c + σ c decision variables for the runs in study 3.The decision criteria are represented by the horizontal lines in the panels.Runs that are on or above the lines were predicted to fail, while those below were predicted to succeed.Correct predictions are displayed in blue (TP and TN), and incorrect predictions in red (FP and FN).The figure indicates, for example, that runs 17 and 120 failed, but were misclassified by D avg because their µ c values were slightly below 0.5.By comparison D sum assigned these runs to the correct class, but also misdiagnosed runs 2 and 95.A visual inspection of the figure shows that, except for the relative position of the threshold, the distribution of points in µ c and µ c + σ c look very similar.We therefore attribute the performance difference to the threshold value.If D avg had used a threshold value of about 0.4 instead of 0.5, it would have made the same predictions and had the same performance as D sum .
At this point, we can also compare the predictive performance of our classification system to the "second design" predictions in Edwards et al. (2011) (see Sect. 4.3 therein).We do not compare to their "first design" results (i.e., the table of Sect.4.2 therein), because those results use the same data to both train and evaluate their statistical model (i.e., they are not predictions).Moreover, Edwards et al. (2011) provide only TN and FN values for this design because they screened out simulations that were predicted to fail.Their model incorrectly classified 26 % of the predicted successes, which we calculate using FN/(FN + TN).By comparison, our system misclassified only 3 and 2 % of the predicted successes using D avg and D sum , respectively.

Retrospective analysis of simulation failures
After study 3 was completed, we applied the same SVM committee to test the performance of an additional criterion based on the signal-to-noise ratio, µ c /σ c , as the decision variable.Without prior knowledge about a setting for the threshold for this criterion, it was not used to predict simulation failures in advance.Retrospectively, we used the same values of µ c and σ c that were used for the predictions in Sect.4.4, but determined and tested a setting for the threshold that maximizes the overall accuracy and minimizes the total number of false outcomes (FP + FN).The resulting criterion is defined by The performance of this criterion is displayed in Table 3 and Figs.7 and 8.As shown, D snr outperforms both D avg and D sum .If included in our set of predictions, this criterion would have made 176 correct predictions (97.8 % accuracy) and only four false predictions balanced between two FNs (runs 27 and 44) and two FPs (runs 15 and 141).D snr operates at ROC point (2/166, 12/14), which is a distance of 0.143 from a perfect classifier.The reason for the improved performance is shown more clearly in Fig. 8.The signal-tonoise ratio better separates the failures and successes than either of the other decision variables, although runs 44 and 141 are still grossly misclassified.In spite of the improvement, it is also worth noting that more simulations lie closer to D snr than either D avg or D sum .This implies that the performance of D snr is more sensitive to slight adjustments in the value of the threshold than the other criteria.
In retrospect, we also varied the thresholds for the three decision variables and calculated the FPRs and TPRs for study 3.The resulting ROC curves and fixed locations of the decision criteria are shown in Fig. 9.The ROC curves for µ c and µ c + σ c nearly overlap, which confirms the previous statement that these two decision variables perform similarly after accounting for threshold differences.Based on their AUC scores, µ c performs marginally better than µ c +σ c because adding committee variability causes some successful simulations to get tallied relatively sooner as FPs (see points with values close to run 27 in Fig. 8).In contrast to these cases, the ROC curve for µ c /σ c is noticeably better and has an AUC of 0.966.This occurs because µ c /σ c is more effective at separating the classes, which enables it to identify more TPs as the threshold is lowered.Overall, however, all three decision variables perform exceedingly well at classifying new simulation failures.Using these ROC curves, we can choose decision criteria for making new predictions that consider the tradeoffs between TPs and FPs.Slightly lowering the threshold in D snr , for example, will increase the TPR and move it to a point that lies closer to a perfect classifier in ROC space, but this occurs at the expense of also increasing the FPR.

Sensitivity analysis of simulation failures
Following on the demonstrated success of our predictions, we used the classifier committee to identify, quantify, and rank the importance of the model parameters responsible for the simulation failures.This information can be used to make the model more robust to parameter perturbations by improving the modules associated with the most sensitive parameters.For this analysis, we drew 10 4 Latin hypercube samples from uniform distributions representing the 18 POP2 parameters, calculated the average failure probability from a committee of SVM classifiers (µ c ) at each of the sample points, and then performed a global sensitivity analysis (Saltelli et al., 2000;Helton et al., 2006) on the parameterinduced variance of log µ c .All of the available simulation data were used to compute the parameter sensitivities by re-training a new committee of 100 SVM classifiers with the full set of 540 simulations from studies 1-3.The training followed the procedure previously described in Sect.4.3.Also note that the sensitivity analysis is illustrated below using µ c as the committee response, but the same general results are obtained using the signal-to-noise ratio (µ c /σ c ).

Polynomial chaos expansion of the failure probability
Parameter sensitivities were measured and ranked using Sobol indices (Sobol, 2001;Saltelli et al., 2000), which decompose the variance of log µ c into contributions from individual parameters and various higher-order combinations of parameters.Polynomial chaos expansions (Wiener, 1938) provide a convenient format for the sensitivity analysis because the squares of the expansion coefficients are directly proportional to Sobol indices (Sudret, 2008;Lucas and Prinn, 2005;Tatang et al., 1997).The distribution of log µ c was fit to N p = 18 parameters using a second-order polynomial chaos expansion expressed as where ξ i is the random variable representation of parameter i, P n (ξ i ) is an n-th order orthogonal polynomial in ξ i , and the a 0 , b i , c i and d ij are expansion coefficients to be determined.
For the case where the ξ i are standard uniform random variables, the P n (ξ i ) are the shifted Legendre polynomials (see Xiu and Karnidakis, 2002) with the following orthogonality property: where δ mn is the Kronecker delta function.The first and second order shifted Legendre polynomials are given by and The coefficients in Eq. ( 13) were determined through least squares, and higher-order terms were not considered because the second-order expansion fits the data very well (adjusted R 2 = 0.98).The resulting fit is given in Table 4, which shows the leading terms of the expansion in two forms.
Analytical expressions for the moments of log µ c as a function of the POP2 parameters were derived by directly taking expectation values of Eq. ( 13).The average value and variance are avg(log µ c ) = a 0 , and var(log The two groups of terms labeled on the right hand side of Eq. ( 18) signify variance contributions from individual parameters (linear and quadratic) and pairs of parameters.The fractional values of the squared polynomial chaos expansion coefficients in Eq. ( 18) follow from application of Eq. ( 14).connections using network graphs with nodes and edges.The size of node i in the graph is proportional to the fractional contribution from parameter i,

Sensitivity network of the failure probability
while the thickness of edge ij connecting node i and node j is proportional to the fractional contribution from joint variations of parameters i and j , The technique has been extended to include higher order effects (e.g., using edge ij k for 3rd-order terms), but this is not needed for the current application.Important parameters on the resulting network graph are represented by nodes that are large or make significant connections to other nodes.

Summary and conclusions
We experienced a series of code crashes while running Latin hypercube ensemble simulations that sampled the values of 18 ocean mixing and viscosity parameters in the POP2 component of CCSM4.The crashes occurred for numerical reasons at different combinations of parameter values, which we surmise is due to violations of numerical conditions defined in the model (e.g., CFL as described in Sect.3).Assuming no special knowledge or physical insight about the specific nature of the crashes, we formulated the simulation crashes as a binary problem (i.e., they fail or succeed) and used machine learning classification to quantify failure probabilities as a function of the 18 model parameters.A highly predictive SVM classification system was trained from a dataset containing only 32 failure instances out of 360 simulations and validated using an independent set of 180 simulations.The resulting classification system had a prediction AUC score exceeding 0.96 and achieved discrimination accuracies above 97 %.Global sensitivity analysis was then used to identify eight model parameters from four different modules that drive high probabilities of failing, the results of which can be used to increase the robustness of CCSM4 to parameter perturbations.These methods can be used to characterize simulation failures in other complex scientific computer models.The climate ensemble failure dataset used for all of the analysis presented in this manuscript is being made available for public download at three sites (Bache and Lichman, 2013;MLdata.org, 2013;Lawrence Livermore National Laboratory Green Data Oasis, 2013).

Figures 1 Fig. 2 .Fig. 2 .
Figures1 and 2show simulation successes and failures for the three Latin hypercube studies (540 runs) as a function of the values of 4 of the 18 parameters sampled in POP2.Similar figures were generated for the other parameters, but are not displayed to keep the discussion brief and because the failures are highly sensitive to changes in these parameters (see Sect. 5).It is not possible to directly visualize the dependencies in high dimensions, so the figures show the outcomes projected in one and two parameter dimensions (Figs. 1 and 2, respectively).From the figures it is clear that the failures are generally concentrated around high values of parameters vconst corr and vconst 2, and at low values of backgrnd vdc1.A weaker dependence of the failures on

Fig. 5 . 34 Fig. 5 .
Fig. 5.The confusion matrix showing the four possible outcomes for a two-class simulation failure problem.

Fig. 7 .Fig. 7 .
Fig. 7.The confusion matrix for predictions of 180 simulations in study 3 using the SVM committee with three different decision criteria (D avg , D sum , and D snr ).

Fig. 8 .Fig. 8 .
Fig. 8. Actual and predicted outcomes are shown for the 180 simulations in study 3. Predictions are based on three decision variables and thresholds(µ c and D avg , top; µ c + σ c and D sum , middle; µ c /σ c and D snr , bottom).The horizontal axis displays simulation numbers based on their order in the ensemble.Simulation failures and successes are shown using stars and circles, respectively.Correct and incorrect predictions are shown in blue and red, respectively.

Fig. 9 .Fig. 9 .
Fig.9.ROC curves for the 180 simulations in study 3 using the SVM committee variables (µ c , µ c +σ c , and µ c /σ c ).The locations of the discriminators using the fixe D sum , and D snr are also shown.

Fig. 10 .Fig. 10 .
Fig. 10.Sensitivity of the probability of simulation failure to climate model parameters is shown using a network graph.Node size and connector thickness are proportional to sensitivity contributions from individual parameters and pairs of parameters, respectively.The eight parameters labeled in red are the main causes of simulation failures.

Table 1 .
Parameters sampled in the CCSM4 parallel ocean model.

Table 2 .
Latin Hypercube Studies of the CCSM4 Parallel Ocean Program.

Table 2 .
Latin Hypercube Studies of the CCSM4 Parallel Ocean Program.

Table 3 .
Predictions and outcomes of simulation crashes in study 3. Actual successes predicted by all decision criteria are not reported here for the sake of brevity.Predictions from D avg and D sum were made before the study, while those from D snr were determined retrospectively.
2. The top four parameters have the largest overall and most heavily connected nodes in the graph, and they collectively account for about 88 % of the variance of log µ c .The strong connections indicate that the probability of simulation failure depends on correlations between the www.geosci-model-dev.net/6/1157/2013/Geosci.Model Dev., 6, 1157-1171, 2013