Spatial evaluation of volcanic ash forecasts using satellite observations

. The decision to close airspace in the event of a volcanic eruption is based on hazard maps of predicted ash extent. These are produced using output from volcanic ash transport and dispersion (VATD) models. In this paper the fractions skill score has been used for the ﬁrst time to evaluate the spatial accuracy of VATD simulations relative to satellite retrievals of volcanic ash. This objective measure of skill provides more information than traditional point-by-point metrics, such as success index and Pearson correlation coefﬁcient, as it takes into the account spatial scale over which skill is being assessed. The FSS determines the scale over which a simulation has skill and can differentiate be-tween a “near miss” and a forecast that is badly misplaced. The idealized scenarios presented show that even simulations with considerable displacement errors have useful skill when evaluated over neighbourhood scales of 200–700 (km) 2 . This method could be used to compare forecasts produced by different VATDs or using different model parameters, assess the impact of assimilating satellite-retrieved ash data and evaluate VATD forecasts over a long time period.


Introduction
Volcanic ash provides a significant hazard to aircraft by reducing visibility and causing both temporary engine failure and permanent engine damage.The presence or threat of ash disrupts air traffic and can result in large financial losses to the aviation industry (Casadevall, 1994;Miller and Casadevall, 2000).The 2010 eruption of Eyjafjallajökull disrupted European airspace for 13 days, causing the cancellation of over 95 000 flights and an estimated global financial loss of USD 5 billion (Oxford-Economics, 2010).
In the event of an eruption, the decision to close airspace is based on information provided by one of the nine Volcanic Ash Advisory Centres (VAACs).The VAACs issue hazard maps of predicted ash cloud extent based on forecasts from volcanic ash transport and dispersion models (VATDs).After the large-scale disruption caused by the 2010 Eyjafjallajökull eruption in Iceland, new guidelines were brought in by the UK Civil Aviation Authority requiring predictions of ash concentration values.A small number of studies have been performed to evaluate forecasts of ash concentration, however they almost exclusively use ground-based measurements at point locations or data from short research flights (Dacre et al., 2011;Devenish et al., 2012;Folch et al., 2012;Grant et al., 2012;Kristiansen et al., 2012;Webster et al., 2012;Dacre et al., 2013) and although these data have high temporal resolution it is only possible to evaluate the model at a limited number of locations.
Satellite observations of volcanic ash clouds are vital for tracking the transport of the erupted ash.The high temporal and spatial resolution of data from geostationary satellites lends itself to data assimilation and model verification.Satellite imagery is an invaluable tool for forecasters and is used qualitatively by VAACs to give an indication of the accuracy of the location of the ash cloud predicted by VATDs.However, these comparisons are carried out manually and do not provide an objective measure of the skill of the VATD forecasts.Therefore it is not easily possible to compare the skill of forecasts made at different times or by different models, or to assess the impact of changing the value of a model input or parameterization.The large spatial coverage of the satellite observations provides an opportunity to quantitatively evaluate forecasts over a much larger area than was previously possible using ground-based or in situ measurements.
The evaluation of a 2-D forecast field presents many challenges.Straightforward summary statistics, such as root mean square error, and binary skill score measures based on hits, misses, false alarms and correct rejections which are used to evaluate forecast performance at a particular point are not always easy to interpret and can lead to an underestimation of forecast skill.For example, if a volcanic plume is forecast to have the perfect shape but is displaced due to small errors in wind speed, metrics that compare each point in space and time (known as point-by-point in this paper) would yield low values as the feature is not in the correct place at the correct time.This problem has given rise to a host of other techniques to evaluate model skill, each suitable for evaluating different aspects of the forecast (see Gilleland et al., 2010, for a review of these techniques).In this paper the spatial accuracy of the VATD forecasts is being assessed and therefore a neighbourhood technique is used.
The perceived accuracy of any forecast depends on the scale over which it is being assessed (if a spatial tolerance is acceptable).For example, it is easier to predict the presence of ash in a large area than a small one.Previous studies using point locations and point-by-point metrics to evaluate forecasts of volcanic ash fail to recognize forecasts that contain useful information unless it is in exactly the right place and at the right time.Many forecasts do have valuable information about the ash cloud in spite of small positional errors.For example, Webster et al. (2012) found an increase in agreement between simulated and observed ash concentrations if a "buffer zone" accounting for positional errors in the simulated ash cloud was used.Similarly Dacre et al. (2011) showed that if a temporal error of 9 h (equating to approximately 100 km displacement in space) was taken into account then the simulated ash column loadings match well with lidar observations.The aim of this paper is to develop an evaluation metric that can determine the spatial accuracy of volcanic ash forecasts.This metric utilizes a neighbourhood-based measure of skill called the fractions skill score (FSS) (Roberts and Lean, 2008).This skill score was developed for the verification of precipitation forecasts produced by numerical weather prediction (NWP) models.This technique has been chosen as it relaxes the requirements for exact matching between forecasts and observations; the fractional coverage of simulated ash within an area needs to match the fractional coverage of the satellite-retrieved ash to be counted as correct.It also provides users with information on the scale at which an acceptable level of skill is attained.To illustrate the use of this new technique VATD simulations made using the Numerical Atmospheric-dispersion Modelling Environment (NAME) (Jones et al., 2007) of the ash cloud from the 2010 Eyjafjallajökull eruption are evaluated against SEVIRI satellite observations made on 7, 9 and 14 May 2010.

NAME Simulations
NAME is the operational VATD used by the London VAAC.It is a Lagrangian particle dispersion model originally developed in response to the 1986 Chernoybl disaster.Particles, each representing a mass of volcanic ash, are released from a source (Jones et al., 2007).The particles are passively advected by 3-D wind fields provided by, in this case, the UK Met Office global NWP model analysis updated every 6 h and forecast fields updated every 3 h.The effect of turbulence is represented by stochastic perturbations to the particle trajectories based on semi-empirical turbulence profiles.NAME also includes parameterizations of sedimentation, dry deposition and wet deposition (Witham et al., 2012).The ash concentrations are calculated by summing the mass of particles in the model grid boxes and over 1 h.In this study the model grid boxes are 0.375 • latitude by 0.5625 • longitude (approximately 40 km × 40 km).
To predict the transport and dispersion of ash, information about the volcanic eruption is required.These are known as eruption source parameters (ESPs) and include plume rise height, mass eruption rate, vertical profile of the plume, particle density, and particle size distribution.In the simulations presented in this paper the plume height is based on observations by the Icelandic Meteorological Office's C-band radar (Arason et al., 2011) located at Keflavík International Airport.Note that the height of the plume varies over the time of the simulation presented here.It is assumed that the ash was distributed uniformly throughout the height of the plume.The mass eruption rate is given by an empirical relationship based on the plume height given by Mastin et al. (2009).The ash density is assumed to be 2500 kg m −3 and the particles are assumed to have a diameter of 1-3 µm.The choice of model parameters used here are similar to those used in Grant et al. (2012) but the technique presented here could be applied to any VATD simulation.The simulations presented in this study have a start time of 06:00 UTC on 1 May 2010.

SEVIRI Satellite observations
The Spinning Enhanced Visible and Infrared Imager (SE-VIRI) is mounted on the geosynchronous Meteosat Second Generation (MSG) satellite.It has 12 spectral channels and provides high temporal (15 min) and spatial (3 km resolution at the equator) observations.The high temporal and spatial resolution makes these observations ideally suited to evaluating the transport of volcanic ash following an eruption.
The volcanic ash measurements used in this paper are retrieved using the algorithm of Francis et al. (2012) which utilizes three long-wave window channels centred at 8.7, 10.8, and 12.0 µm to discriminate between meteorological cloud and ash cloud.Where ash is detected this algorithm determines ash layer top pressure, ash column loading and ash ef-fective radius.In this paper ash column loading is used to determine the horizontal accuracy of the simulated ash clouds.It is important to note that the detection of volcanic ash by satellite is dependent on the optical depth of the cloud and the physical properties of the ash.Optically thin ash clouds and ash particles smaller than 0.2 µm may not be detected.Following this, the minimum detection limit of ash is considered to be in the range of 0.2-1.0g m −2 (Francis et al., 2012;Prata and Prata, 2012).Other factors, namely the thermal contrast between the ash and the underlying surface, satellite viewing angle, ash cloud height and the presence of other absorbers (e.g.water, ice and sulphur dioxide), also affect the detection and retrieval of ash properties (Millington et al., 2012).A case study comparison for 17 May 2010 between retrieved column loadings and airborne lidar data is presented in Francis et al. (2012).The mass column loading values are in reasonable agreement with maximum values of 0.7-0.8g m −2 in both data sets.The column loading values derived in Francis et al. (2012) are also qualitatively comparable to those presented in Thomas and Prata (2011).By applying their retrieval algorithm Dubuisson et al. (2014) found comparable values to Francis et al. (2012) for mean effective radius, plume height and mass loading for 6 May 2010.
For comparison with NAME the satellite-retrieved column integrated loadings are averaged on to a regular 0.375 • × 0.5625 • grid and averaged over a period of 5 h centred on the verification time.This time averaging is used to smooth the SEVIRI ash observations which can be very patchy.The choice of a 5 h averaging time was based on the results of a set of simple data denial experiments.The results of these experiments can be found in Appendix A.

The evaluation method
There are many neighbourhood skill scores described in the literature (see Ebert, 2008, andGilleland et al., 2010, for an overview).The method used in this paper is based on the FSS developed by Roberts and Lean (2008) to test the skill of high-resolution precipitation forecasts (e.g.Roberts, 2008, andMittermaier andRoberts, 2010) and is routinely computed for that purpose in the operational verification suite at the UK Met Office (Mittermaier et al., 2013).It compares fractional coverage in the forecast field with fractional coverage in the observational field for a specified precipitation threshold and over a range of neighbourhood sizes to determine the spatial scale over which a simulation can be considered skilful.
The evaluation is performed in two stages.First the simulation and satellite fractions (where fractions are the fractional coverage of a specified neighbourhood size in which pixels exceed a pre-defined threshold) are generated, then these fractions are compared using FSS.Here we initially focus on a case study hour at 00:00 UTC on 14 May 2010 during the Eyjafjallajökull eruption (day 31 of the eruption).Fig- ure 1a shows the detected ash column loadings by SEVIRI at 00:00 UTC on the 14 May.The ash cloud was detected in a coherent plume extending south-eastwards from Iceland to the northwest of the UK.There is also a small patch of ash detected north of Iceland. Figure 1b shows the corresponding NAME simulated ash column loading at the same time.Note that this is day 14 of the simulation.A visual comparison of the satellite and NAME ash clouds suggests that at this time there is good agreement in the location of the maximum ash column loadings.

Stage 1: generating the fractional coverage
In general, NAME simulates a more extensive ash cloud structure than the satellite observations.This is largely due to the minimum detection limit of the satellite observations.Therefore, to perform a meaningful quantitative evaluation between the simulated and satellite-retrieved ash cloud, a threshold must be applied to the NAME column loadings.In the case of precipitation forecasts a 95th percentile threshold is commonly used.This threshold selects the highest 5 % of radar and simulated precipitation accumulations in the do-  main independently.This is done to remove any bias in precipitation amounts when the focus is to look at the spatial accuracy of the forecast only.In the case of volcanic ash a fixed percentile threshold is not appropriate due to the artificial cut off in the distribution of retrieved ash column loadings due to the detection limit of the satellite.This cut off can be seen in Fig. 3a.Ash column loadings less than 0.2 g m −2 are not retrieved during the period 7-16 May 2010.
The satellite-retrieved values of ash column loading often have large errors associated with them (Francis, personal communication).We considered the values as a binary ash/no ash detection flag.The detection limit means that there are far more grid boxes populated with ash in the simulations than in the satellite observations.Therefore to ensure a fair comparison with the satellite the number of simulated ash grid boxes used in the comparison is restricted to match the number of grid boxes with observed ash (i.e. the area of ash cloud being compared in both the NAME simulation and satellite observations is the same at each evaluation time).For example, if there are 250 grid boxes with satellite-retrieved ash then the 250 NAME grid boxes with the highest ash column loading are used in the comparison.This removes bias from the forecast and is equivalent to using a time varying percentile threshold (Fig. 3b).This process will be referred to as pixel matching in this paper.The fraction of the domain covered by satellite-retrieved ash varies between 3.4 and 14.6 % giving a percentile threshold of 85.4-96.6 %.An example of how this pixel matching modifies the NAME ash distribution is shown in Fig. 1c.In this case the number of satellite pixels containing ash is 422, giving a percentile threshold of 94.6 % and a NAME concentration threshold of 0.6 g m −2 at this time (comparable to the stated minimum detection limit of Francis et al., 2012, andPrata andPrata, 2012) when assuming a distal fine ash fraction (DFAF) of 3 %.DFAF is the percentage of the ash vented from the volcano that undergoes long-range transport (Dacre et al., 2011;Grant et al., 2012;Devenish et al., 2012).Note that the ash column loading threshold can vary from 0.2 to 1.2 g m −2 at this time when using other plausible DFAFs of 1 and 6 % respectively (Fig. 3b).Three further examples of pixel matching at 21:00 UTC 7 May 2010, 00:00 UTC 9 May 2010 and 12:00 UTC 14 May 2010 are shown in Fig. 2.
The fraction of grid points containing ash for different sized square neighbourhoods centred on each gridbox is then calculated for both the pixel matched NAME data and satellite observations.In this paper neighbourhood sizes of 40-1200 (km) 2 are considered.Note that in this paper 40 (km) 2 represents a neighbourhood size of 40 km × 40 km, approximately equal to the grid scale.

Stage 2: computing the FSS
The FSS is calculated in the following way: (Roberts and Lean, 2008) where the Fractions Brier Score (FBS) is a variation on the Brier Score (Brier, 1950) in which both the simulated and observed probabilities (or fractions) can have any value between 0 and 1. FBS is given by: (2) M j and O j are the modelled and observed fractions respectively at each point, with values between 0 and 1. N is the number of pixels in the verification area.FBS ref is given by FBS ref is the largest FBS that could be obtained from the simulated and observed fraction which occurs when there is no collocation of non-zero fractions.A FSS of 1 indicates a perfect match between the modelled and observed fractions whilst a FSS of 0 indicates a complete mismatch.In general, a forecast with FSS > 0.5 is considered skilful (Roberts and Lean, 2008).The FSS, calculated using a 40 (km) 2 neighbourhood (the grid scale), at 00:00 UTC on 14 May 2010 is 0.51 indicating that the NAME simulation has skill in capturing the satelliteretrieved spatial distribution of volcanic ash at this scale.This objective measure agrees with the subjective visual comparison of Fig. 1a and c which show fairly good spatial agreement in the location of the ash cloud at the 40 (km) 2 scale.
5 What if the simulated ash cloud is displaced from the satellite-retrieved ash cloud?
One vital input parameter for a VATD is the height of the plume.At the time of eruption this can be uncertain and can evolve throughout the eruption period.The use of an incorrect plume height could result in ash being transported in a different direction and at a different speed than it experiences in reality due to changes in windspeed and direction with height.In this section a set of idealized scenarios are presented where the NAME simulated ash plume is artificially where s is a stretching factor and E lat and E lon are the latitude and longitude of Eyjafjallajökull.The NAME simulated ash cloud is interpolated on to this transformed grid.Note that the stretching transformation is applied to the NAME output before pixel matching to ensure that the number of grid cells with simulated and retrieved ash remain the same.
Figure 8 shows how the transformations applied to the simulated ash plume affect the FSS as a function of neighbourhood size for (a) 00:00 UTC 14 May 2010, (b) 12:00 UTC 14 May 2010, (c) 00:00 UTC 9 May 2010, and (d) 21:00 UTC 7 May 2010.In all cases, the largest values of FSS are given by the simulated ash with no stretch transformation.In each case, apart from 12:00 UTC 14 May, the NAME is skilful (FSS > 0.5) for a neighbourhood size of 40 (km) 2 (the grid scale).In all cases, FSS reduces as the stretch transformation becomes more extreme.This is in agreement with the authors' subjective visual inspection of Figs.4-7.For the most conservative stretch scenario (fac-   4-7, a FSS of 0.5 is reached at neighbourhood sizes of 120-200 (km) 2 in all cases apart from 12:00 UTC 14 May which reaches skilful level at 360 (km) 2 .When considering the stretch factor 0.5 case, panel b, the threshold for skill is not reached until neighbourhoods of 680 (km) 2 are used for all cases apart from 21:00 UTC 7 May.In this case, the skilful level is reached when using a neighbourhood size 280-360 (km) 2 .Having skill at a neighbourhood size 680 (km) 2 is comparable to using a grid box of 6 • × 6 • at these latitudes.A simulation that has skill at this scale could predict the presence of ash regionally in the UK (i.e.distinguish between London, Manchester, and Edinburgh airports).A simulation with skill only at larger scales would be not be useful.In the cases presented here the transformations using stretch factor 2 (panel d), perform the worst in all cases apart from 00:00 UTC 9 May.It does not reach the skilful level until neighbourhood sizes greater than 1000 (km) 2 are used.Note that in all cases presented here skill continues to increase with increasing neighbourhood size after the 0.5 skilful threshold has been reached.This analysis demonstrates that even though there may be a location error in the simulated distribution of ash, the simulations are still skilful using the FSS measure and therefore provide useful information at scales that are helpful even though traditional point-by-point measures may consider them unskilful.Table 1 shows the value of success index (SI), Pearson correlation coefficient (PCC) and FSS for neighbourhood sizes of 600 (km) 2 .SI, also known as the critical success index (Schaefer, 1990), is a simple metric based on a 2 × 2 contingency table of hits (a), false alarms (b), misses (c) and correct rejections (d).It is given by SI = a/(a + b + c), it assesses the match between the area of simulated ash cloud and area of satellite-retrieved ash cloud (Stunder et al., 2007).An SI of 1 indicates complete overlap between simulated and retrieved ash whereas an SI equal to 0 indicates no overlap.Stunder et al. (2007) suggests that a forecast with an SI value 0.25 is an acceptable forecast.SI is calculated in Webley et al. (2009) to compare the output from two different VATDs with different eruption source parameters for the 1992 Mount Spur eruption.The SI values found in this study range from 0.17 to 0.60.PCC is also known as the linear correlation coefficient.A simulation with a PCC value of 1 has complete correlation between the simulated and measured ash cloud.PCC is one of the measures calculated by Kristiansen et al. (2012) to evaluate and compare the skill of several different VATDs.Kristiansen et al. (2012) consider 0.36-0.48 to be significant correlations.For all the skill metrics the highest values are for the simulation with no stretch.The simulation with stretch factor 1.2 has the next highest values of skill.In the case of no stretch and stretch factor 1.2 the FSS values are greater than the 0.5 threshold for skill, the PCC values fall within the bounds Kristiansen et al. (2012) consider skilful and the SI values are within the range Webley et al. (2009) found in their analysis of the impact of the vertical distribution of ash and ash particle size distribution.For the 00:00 UTC 14 May case, the SI and PCC for both stretch factor 0.5 and stretch factor 2 are very low and, by chance, equal, however by subjective visual inspection the stretch factor 0.5 ash cloud appears to more closely match the satellite-retrieved ash than the stretch factor 2 ash cloud.This is supported by the FSS score for the stretch factor 0.5 ash cloud having a higher FSS than the stretch factor 2 cloud at smaller spatial scales.This highlights the fact that point-by-point measures are unable to distinguish between a simulation that is a near-miss or a simulation that is completely wrong, although they do still pick out the "best" simulation in this instance.Similar results are seen for the three other examples (see Table 1).

Summary and conclusions
In this paper it has been shown that a neighbourhood-based metric fractions skill score (FSS) is suitable for evaluating simulations of volcanic ash clouds using satellite observations.This measure of skill provides more information than traditional point-by-point metrics, such as success index and Pearson correlation coefficient, as it takes into account spatial scale over which skill is being assessed and can be used to determine the spatial scale over which the VATD model should be believed.In the case studies presented here the NAME simulation had skill (FSS > 0.5) at neighbourhood scale of ∼ 40 (km) 2 (the grid resolution).Even simulations with considerable displacement errors have skill when using larger neighbourhood sizes of 200-700 (km) 2 .The advantage of this kind of evaluation is that the objectively determined results for a set of idealized displacement scenarios are often much more similar to a subjective visual inspection of the simulations than other evaluation measures.Although the evaluation in this paper has focussed on a set of idealized scenarios the FSS method could, in principle, be used to evaluate forecasts over a longer period of time.It could also be used to compare forecasts with different ESPs or model parameters, or forecasts from an ensemble of simulations performed with different models, input meteorology and emissions, or assess the impact of assimilation of satellite data.This will be the focus of future studies.The assimilation could be for the ESPs (e.g.Stohl et al., 2011) or the distribution of ash downstream from the volcano (e.g.Wilkins et al., 2015).The methodology presented could also be extended to the distribution of sulphur dioxide following an eruption or to forecasts of other dispersion events, for example, after a nuclear incident or a forest fire.

Appendix A: SEVIRI retrieval smoothing time
This section describes the data denial experiments used to determine the SEVIRI smoothing time used in this study.In these experiments satellite-retrieved ash column loadings at a verification time (t 0 ) were considered the "truth" and compared using the root mean squared error (RMSE) to satelliteretrieved ash column loadings with 50 % of the pixels randomly removed and replaced with a time-averaged field using observations up to 8 h before and after t 0 .This was done for each hour in the period 8-14 May 2010.This experiment was performed 50 times using different random sampling to assess the spread in the RMSE due to different areas in the plume being replaced.
Figure A1 shows the results of the data denial experiments.The solid symbols show the median RMSE value and the boxes indicate the interquartile range.There are several interesting points to note.Firstly, there is a large spread between different days.This is due to the time varying mass eruption rate of the volcano and changing meteorological conditions.Secondly, the minimum in the RMSE does not always occur when the data from the closest times are used.This is most evident on 9, 10 and 14 May where there is a minimum at ±2 h.On these days there is also only a small variation in RMSE when the averaging window is increased from ±2 to ±8 h.It can also be seen that as the averaging window increases the distribution of RMSE values becomes more negatively skewed.RMSE penalizes variance as it gives errors with larger absolute magnitudes more weight than errors with small absolute values.It is thus sensitive to outliers, which are reduced by the time-averaging method.This is one disadvantage of using RMSE to compare satellite images, or in fact any pair of 2-D fields and provides further motivation for new verification measures.On 8, 11, 12, 13 May the behaviour is monotonic, as the RMSE increases as the averaging window increases, however there is little difference in RMSE between using ±1 or ±2 h.The interquartile ranges on these days show the distribution of RMSE is more Gaussian.Similar results are obtained if 20, 80 and 100 % of the data are replaced (not shown).

Figure 1 .
Figure 1.Ash column loading at 00:00 UTC on 14 May 2010 (a) by the satellite (with 5 h smoothing), (b) simulated by NAME, (c) NAME simulated ash cloud after pixel matching (i.e.black indicates pixels selected in satellite matching process).Panel (a) uses the colour scale shown in panel (b).
stretched and squashed to represent the possible impact of an incorrect plume height.The transformations used are shown in Figs.4-7 and are performed in the following way:

Figure 8 .
Figure 8.The FSS as a function of neighbourhood size for each of the three translations (blue line: stretch factor 0.5, green line: stretch factor 1.2 and yellow line: stretch factor 2) compared to the original NAME simulation (red line) shown in Figs.4-7.

Figure A1 .
Figure A1.The median RMSE between the SEVIRI observations at t 0 ("truth") and the truth with 50 % of the pixels randomly replaced by the time-averaged observations for each day 8-14 May 2010 (8 May: stars, 9 May: downward-pointing triangles, 10 May: pentagons, 11 May: hexagons, 12 May: upward-pointing triangles, 13 May: circles, 14 May: squares).Each random replacement is repeated 50 times and the error bars show the interquartile range of the RMSE from these iterations.

Table 1 .
The value of success index (SI), Pearson correlation coefficient (PCC), FSS for a neighbourhood of 600 (km) 2 and the scale at which the FSS reaches a value of 0.5 for the scenarios presented in Figs.4-7.