Interactive comment on “ Evaluation of numerical weather prediction model precipitation forecasts for use in short-term streamflow forecasting

this manuscript presents an evaluation of four NWP models over a 5500 km2 watershed in southeastern Australia, with an intended emphasis in streamflow forecasting. Forecasts are evaluated against individual station observations, without spatial interpolation, and against averaged precipitation over the catchment area. Continuous and categorical skill metrics are employed, and the influence of averaging window (3h up to 24h) as well as lead-time (3h up to 228h) is assessed. The major findings

time and subsequently forcing the model with future weather conditions for the forecast period.A major source of uncertainty in this process is future precipitation.Numerical Weather Prediction (NWP) models have been used since 1946 to forecast precipitation and other atmospheric variables.However, forecasting precipitation is challenging because it is discontinuous and varies rapidly in space and time.The precipitation process depends not only on the synoptic situation but also on processes that are not explicitly considered by NWP models, including condensation, vertical convective transport of heat and moisture and phase transitions of water between vapour, clouds, and ice (Damrath et al., 2000).Increased computing power and improvement of the NWP models have lead to considerable advancement in the ability to predict precipitation.However, skills of the NWP models to forecast precipitation are still relatively low, especially for very short lead times (e.g.< 12 h), for long lead times (e.g.> 5 days) and for fine scale weather systems such as local-regional convective systems (e.g.thunderstorms).Some of the earliest experiments linking precipitation forecasts to hydrological applications began three decades ago (see Georgakakos and Hudlow, 1984).Accurate precipitation forecasts can reduce forcing uncertainty in hydrological (e.g.rainfall-runoff) models and can greatly improve the quality of streamflow forecasts.However, NWP precipitation forecasts are subject to three types of error (Habets et al., 2004): localisation, timing and intensity of precipitation events, which potentially limit their usefulness for streamflow forecasting.Hydrological models are sensitive to errors in the precipitation forecasts, which are propagated to the model outputs.Thus raw NWP model precipitation forecasts are rarely directly used to forecast streamflow.
The contribution of precipitation forecasts to the skill of streamflow forecasts is dependent on many factors, including lead time.At lead times that are less than the time of concentration of a catchment, precipitation forecasts will contribute little skill to streamflow forecasts.During this period, catchment and channel storage and the passage of an existing flood wave downstream are the main influences on the streamflow forecasts.NWP model precipitation forecasts also typically unable to resolve the observed Figures precipitation distribution at very short lead times, and persistence or extrapolationbased methods can provide better forecasts.Hence, NWP model precipitation forecasts are more useful for streamflow forecasting in extending forecast lead time, particularly in the range of a few days to one or two weeks (Cloke and Pappenberger, 2009;Cuo et al., 2011).However, the extent to which precipitation forecasts are beneficial for streamflow forecasts depends considerably on the ability of the NWP models to resolve the scale and processes relevant for hydrological applications and whether the surface hydrology in the catchment is dominated by precipitation (Clark and Hay, 2004;Gebhardt et al., 2008).
Understanding the quality of NWP precipitation forecasts is important step in assessing their potential contribution to the skill of streamflow forecasts.Objective evaluation or verification of precipitation forecasts did not begin until mid 1990s (e.g.WMO Working Group on Numerical Experimentation, WWRP/WGNE, 2008).The overall purpose of evaluation is to ensure that forecasts are accurate, skilful and reliable from a technical point of view.Evaluation of precipitation forecasts is important to monitor forecast quality over space and time, to compare the quality of different forecast systems and to discover sources of model error to improve the forecast quality (WMO, 2000;WWRP/WGNE, 2008;Casati et al., 2008).However, from a streamflow forecasting perspective, forecast evaluation is to understand the nature of forecast errors (e.g.bias, error on light versus heavy rain) which can inform the development of methods for post-processing raw forecasts to improve accuracy and reliability.Evaluation of NWP model forecasts of precipitation is not a new topic.Numerous authors have verified precipitation forecasts from a meteorological perspective (e.g.Jolliffe and Stephenson, 2012).However few authors have evaluated precipitation forecasts from a hydrological perspective (see e.g.Pappenberger et al., 2008).Georgakakos and Hudlow (1984) highlighted the relevance of precipitation forecasts products to real-time hydrological forecasting.Golding (2000) identified the critical areas where NWP products fall short, and illustrated techniques being developed to address them.Damrath et al. (2000) verified 7 yr of precipitation forecasts from NWP models Introduction

Conclusions References
Tables Figures

Back Close
Full of the German Weather Services.Kaufmann et al. (2003) evaluated the quality of 8 yr of precipitation forecasts from the Swiss Model in Switzerland.Hay and Clark (2003) used 40 yr of 8-day ahead precipitation forecasts over the contiguous United States from the National Centres for Environmental Prediction reanalysis project to assess the possibilities for using the medium-range forecast model output.Richard et al. (2003) compared four European and Canadian mesoscale models for precipitation forecasting to reproduce heavy precipitation events.Habets et al. (2004) used precipitation forecasts from two French NWP models as inputs to a hydrologic model.Roy Bhowmik et al. (2007) evaluated precipitation predictive skill of the Indian Meteorological Department operational NWP system over the Indian monsoon region.Roberts (2008) assessed the spatial and temporal variation in the skill of precipitation forecasts from a NWP model.Roberts et al. (2009) demonstrated the benefit of using high resolution NWP model precipitation forecasts for flood and short-term streamflow forecasting.Ghile and Schulze (2010) verified the skill and accuracy of the precipitation forecasts by three NWP models over the Mgeni catchment in South Africa.Ghelli and Ebert (2008) and Jolliffe and Stephenson (2012) presented a comprehensive review and the state of art in forecast verification.This study focuses on comprehensive analysis of the NWP precipitation forecasts in Australia from a hydrological perspective.Unlike many synoptic-scale precipitation verification studies undertaken from a meteorological perspective, this study evaluates the precipitation forecasts on scales relevant to hydrology.The evaluation of the precipitation forecasts and other meteorological variables from a hydrological point of view is challenging because the resolution of the NWP model is often too coarse to resolve the small catchment scale.Furthermore irregular catchment boundaries do not necessarily coincide with NWP model grids.This may require an interpolation of the NWP model precipitation forecasts.The verification of precipitation forecasts from a hydrological perspective requires at short temporal resolution (e.g.sub-daily).
Few studies have verified NWP precipitation forecasts for Australia.McBride and Ebert (2000) verified precipitation forecasts from 7 international (including Australian) Introduction

Conclusions References
Tables Figures

Back Close
Full NWP models over Australia.They verified 24 h total (daily) precipitation forecasts for the first 24 h of the forecast period over a one-year period using only categorical verification scores.The verification statistics are presented over a standardized 1 • latitudelongitude grid over the continent of Australia.Ebert et al. (2003) reported the WGNE assessment of short-term precipitation forecasts from several international NWP global and regional models in different areas of the globe including Australia.Forecasts of 24 h precipitation totals were verified at lead times of 24 and 48 h over Australia using only two categorical evaluation scores.
In contrast to previous studies, the main contributions of this study are to (i) evaluate the quality of the latest generation Australian NWP models, (ii) use both continuous and categorical evaluation scores, (iii) analyse the evaluation scores of precipitation forecasts at multiple sub-daily temporal resolutions out to longer forecast lead times, (iv) investigate diurnal cycle and uncertainty analysis of the evaluation scores.The study assesses the skill of precipitation forecasts from the latest Australian NWP models over the Ovens catchment in southeast Australia.Precipitation forecasts are verified against station and catchment average precipitation using a number of evaluation scores at different temporal resolutions.

Description of ACCESS models
The Australian Community Climate Earth-System Simulator (ACCESS) model suite (BoM, 2010)  of ACCESS uses version 6.4 of the Unified Model from the UK Met Office.Key features of various components and physical parameterisations are given in BoM (2010).The ACCESS APS0 system comprises a global model (ACCESS-G) with a 80 km resolution and forecast duration of 10 days; regional models (ACCESS-R, and ACCESS-T) with a 37.5 km resolution and forecast duration of 3 days; an Australian model (ACCESS-A) with a 12 km resolution and forecast duration of 2 days, city models (ACCESS-VT, ACCESS-S, ACCESS-P, ACCESS-BR) with a 5 km resolution and forecast duration of 36 h, and a tropical cyclone (ACCESS-TC) with 12 km resolution, a relocatable spatial domain and forecast duration of 3 days.Currently new versions of the ACCESS models (APS1) with improved resolution and model physics are being introduced at the BoM.
Figure 1 shows the domains of ACCESS APS0 (ACCESS-G, ACCESS-R, ACCESS-A, and ACCESS-VT) models which are used in this study.
ACCESS system uses a four-dimensional variational data assimilation scheme which allows observations made at a range of times and locations to be used to initialise the model in a dynamically consistent way.Data assimilation occurs 4 times daily for nominal assimilation base times of 00:00, 06:00, 12:00 and 18:00 UTC.However for ACCESS-G and ACCESS-VT, full model forecasts are only run at 00:00 and 12:00 UTC.In contrast, for ACCESS-R and ACCESS-A full model forecasts are run 4 times daily at 00:00, 06:00, 12:00, and 18:00 UTC.For ACCESS-R and ACCESS-A, a second update data assimilation step is run 4 h later than the main run to make use of any additional observational data that were not available at the time of the earlier main assimilation step (BoM, 2010).
This study uses the archive of precipitation forecasts generated in real time by the ACCESS models.This archive began on late 2009 and has been maintained through to the present.Table 1 shows the archive of precipitation forecasts (issued at 12:00 UTC) available for the study.The BoM expects to run the hydrologic models around 09:00 LT (Fig. 2).The most recent ACCESS model forecasts available at 09:00 LT are those initialised at 12:00 UTC (22:00 LT in Victoria).Therefore, the results presented in this study disregard the first 11 h of the NWP forecast.NWP forecast for the first few hours Introduction

Conclusions References
Tables Figures

Back Close
Full are generally regarded as not reliable because of the so-called "spin-up" time (Kasahara et al., 1992).Thus our results are considered to be free from model spin-up effects.
Precipitation forecasts from all models are available at hourly intervals for this study, with the exception of ACCESS-G which are available at 3 hourly intervals.In order to compare the skill among the models, only one year period of data from 31 March 2010 to 30 March 2011 (see Table 1) is selected for the analysis.

Study area
In this study, the Ovens catchment in Southeast Australia is selected to evaluate the skill of the precipitation forecasts from ACCESS models (Fig. 3).The Ovens catchment is the focus of a prototype flood and short-term streamflow forecasting service with lead times up to 10 days run by the BoM.The Ovens catchment provides a significant source of unregulated inflow to the Murray Darling Basin and has several urban centres that have experienced significant economic damage from flooding.
The Ovens river rises in the Victorian Alps and the catchment is bounded by several significant peaks, including Mount Hotham (elevation 1861 m, longitude 147.33 Full Figure 3 also shows the spatial resolution of the ACCESS models with respect to the resolution of the hydrological model (93 sub-catchment areas) currently used in operational streamflow forecasting.The figure shows that the hydrological model resolution is roughly comparable to the 12 km ACCESS-A model grid.Furthermore, the coarser NWP models (viz.ACCESS-R and ACCESS-G) are unlikely to capture gradients of precipitation across the catchment.The 80 km resolution ACCESS-G model has only 4 grid cells across the catchment and more than three-quarter of the catchment is covered by a single grid cell.
Observed precipitation data were collected from 33 measurement stations that are used for operational forecasting in the Ovens catchment (Table 2).The measurement stations are reasonably distributed across the catchment and surroundings as shown in Fig. 3. Careful preparation of the precipitation observations was necessary and included removal of suspicious data and infilling of missing values.The infilling process related daily precipitation totals at the measurement stations to gridded daily precipitation data from the Australian Water Availability Project (Jones et al., 2009) and disaggregated the daily total using the concurrent temporal pattern from the nearest available station.A manual/visual quality control system was also used to identify and replace outliers.
Catchment average precipitation was estimated as the area-weighted average of sub-catchment precipitation.Sub-catchment precipitation data were derived by inverse distance weighting of precipitation from the nearby stations.The station precipitation time series were serially complete before inverse distance weighting to subcatchment centroids.The sub-catchment precipitation was used to drive hydrological model, whereas catchment average precipitation was used to evaluate precipitation forecasts from NWP models at the catchment scale.The spatial resolution of the global model is too coarse to carry out the evaluation at sub-catchment scale.Introduction

Conclusions References
Tables Figures

Back Close
Full

Evaluation methods
The skill of NWP precipitation forecasts is known to vary in space and time.Therefore, an evaluation of the NWP precipitation forecasts should be aimed to reflect this characteristic.WWRP/WGNE (2008) recommended that evaluation be done both against gridded (model-oriented evaluation) observations and station observations (user-oriented evaluation).Model-oriented evaluation includes processing of observation data to match the spatial and temporal scales of the model.User-oriented evaluation uses station observations to evaluate model output from the overlying model grid cell.
In this study the evaluation of the quality of NWP precipitation forecasts is done both at stations and gridded observations.Station-based evaluation is done by directly comparing the station and NWP precipitation amounts at the model grid cell in which the station exists.While this method is simplistic, any alternative would involve a spatial interpolation of precipitation data from irregularly spaced measurement stations which may introduce further bias (Richard et al., 2003).Although this verification approach has deficiencies (Roberts, 2008), direct comparison facilitates the understanding of skill from a user's perspective (i.e.without any interpolation or reanalysis).Furthermore, hydrological models are commonly calibrated with station observations and, therefore, an evaluation of quality and skill of NWP model has to be performed using observations (Pappenberger et al., 2008).The evaluation scores (described below) are computed for all 33 measurement stations over the Ovens catchment individually by averaging over a period of one year (time averaging).Evaluation using gridded observations is done at catchment scale where the grid is defined by an irregular catchment boundary rather than the NWP model grid.Evaluation is done by comparing interpolated catchment average precipitation and corresponding NWP precipitation forecast.Catchment average precipitation forecast F c is computed by weighting each precipitation forecast F i at grid cell i by the fraction of the catchment area within the grid cell i and given by: Introduction

Conclusions References
Tables Figures

Back Close
Full where A i is the area of catchment within the grid cell i , N g is the number of the grid cells covered partly or fully by the catchment.
As no single evaluation score is adequate to judge the quality of NWP model precipitation forecasts, a large variety of scores are used operationally to verify them (see e.g.Stanski et al., 1989;Wilks, 2006;Wilson, 2001;WWRP/WGNE, 2008).A detailed assessment of the strengths and weaknesses of a set of forecasts usually requires more than one or two summary scores (Jolliffe and Stephenson, 2012).In this study, forecasts of precipitation amount are evaluated using three commonly used continuous verification scores: root mean square error (RMSE), bias and correlation coefficient.
These scores assess different aspects of forecast quality.RMSE is one of the most basic and widely used methods of verification, and assesses the average magnitude of forecast errors (Stanski et al., 1989).Bias assesses the difference between the mean of forecasts and mean of the corresponding observations.The correlation coefficient reflects linear association between the forecasts and observations.The Pearson product moment correlation coefficient is not sensitive to biases that may be present in the forecasts, it is, however, sensitive to outliers (Wilks, 2006).Thus Spearman rank correlation coefficient is more appropriate than Person correlation when data are not normally distributed.Note that above three evaluation scores are related according to the following equation (Murphy, 1988) where S 2 f and S 2 o are the sample variances of the forecasts and observations, respectively, Corr is the Pearson correlation between the forecasts and observations.
From user point of view it is also important to know whether rain occurs or not.Continuous precipitation values can be viewed categorically (or binary for "yes" or Introduction

Conclusions References
Tables Figures

Back Close
Full "no" events) according to whether or not the rain exceeds a given threshold value.
Categorical verification scores are then used to evaluate the occurrence of precipitation.Categorical verification scores are less sensitive to large errors than continuous verifications scores (especially those involving squared errors) which is particularly relevant for highly skewed data such as precipitation amounts.Thus categorical verification scores may give more meaningful information for precipitation verification (WWRP/WGNE, 2008).
A number of the categorical verification scores are computed by building contingency table (Table 3) which shows the joint distribution of observed and forecast events and non-events.In the Table 3, "Hits" represents the number of events for which both forecasts and observations exceed a given threshold, "Misses" represents the number of events for which only observations exceed the threshold, "False Alarms" represents the number of events for which only forecasts exceed the threshold and "Correct Negatives" represents the number of events for which neither forecasts nor observations exceed the threshold.
In this study, probability of detection (POD), false alarm ratio (FAR), frequency bias (FBI) and critical success index (CSI) have been calculated from the contingency table.Table 4 shows the formulae of categorical verification scores with their perfect and possible ranges values.POD measures the fraction of observed events that were correctly forecast and is insensitive to false alarms.FAR gives the fraction of forecast events that were observed to be non-events and ignores the misses.FBI gives the ratio of frequency of forecast rain to the observed rain and does not take into account accuracy.CSI gives the fraction of all forecast and observed events that were correctly diagnosed and does consider both misses and false alarms.
The value of any evaluation score is limited if uncertainty associated with the score is not quantified (Jolliffe, 2007).Any evaluation score must be regarded as a sample estimate of the "true" value for an infinitely large verification dataset.There is therefore some uncertainty associated with the score's value, especially when the sample size is small or the data are not independent, or both (WWRP/WGNE, 2008).In this study Introduction

Conclusions References
Tables Figures

Back Close
Full the uncertainty associated with the evaluation scores are estimated using re-sampling technique.Although it is also possible to compute uncertainty of some scores theoretically assuming some distribution (e.g.Gaussian distribution for correlation coefficients), the distribution of other scores cannot be modelled exactly or approximated by theoretical distributions.Thus we have used re-sampling techniques in order to generate an empirical distribution for the values of the evaluation scores to compute sampling uncertainty.A bootstrap procedure (Efron and Tibshirani, 1993) is used to analyse the sampling uncertainty which addresses the question of what range of scores would be obtained given different sets of forecasts from the same forecast system.
In this study, we present absolute evaluation scores rather than scores relative to some reference (e.g.climatology, persistence etc).This allows for direct comparison of the precipitation forecasts from NWP models of different spatial resolutions between many stations and at different temporal resolutions.Thus, the term "skill" means absolute evaluation score in this study.

Results
The first step of the evaluation is to compare the NWP forecasts to the observations at the point scale (rain gauge stations).Although this eliminates any possible errors due to the spatial interpolation of the station data, errors due to sub-grid scale variability and representativeness may remain.For example, the frequency of zero precipitation at a grid cell will necessarily be less than at a randomly selected point within that (because if it rains anywhere, the grid cell precipitation will be non-zero).In Sect.4.6, we evaluate the skill of NWP model forecasts at the catchment scale.

Forecasts of 1-24 h lead time
Figure 4 shows a map of 24 h mean precipitation accumulation for the measurement stations and for the ACCESS model grid cells over the Ovens catchment.The Introduction

Conclusions References
Tables Figures

Back Close
Full  The ACCESS-A model places the highest precipitation over Mount Feathertop (southeast, near station 25) and east of Wabonga in the southwest interior of the catchment (near stations 16, 20, 21, and 23).The observed precipitation at station 25 is 6.3 mm day −1 and the corresponding ACCESS-A forecast is 6.44 mm day −1 .Purely based on elevation, one would not necessarily expect a maximum in long-term average precipitation in this ungauged area, although it may be due to the precipitation patterns in this particular year.Like the ACCESS-VT, the ACCESS-A model has a tendency to overforecast in lowland areas and underforecast in highland areas.
The ACCESS-R model has a precipitation minimum in the headwaters of the catchment (east of station 33 and near stations 25, 26 30, and 32).These are some of the wettest areas for the high resolution models.Also, the cluster of stations in the southwest part of the catchment has a range of averages that is wide enough to suggest that there is significant within-grid cell variability at this scale.The precipitation maximum is in the northeast corner of the catchment which is a dry region in the Introduction

Conclusions References
Tables Figures

Back Close
Full Figure 5 shows the evaluation scores of the ACCESS models for forecasts of 24 h precipitation accumulations at measurement stations.The RMSE score standardised by standard deviation of observations is shown in Fig. 5a.The RMSE value will be greater than 1 when the mean square error (MSE) exceeds the variance of the observation.This is analogue to a negative value of Nash-Sutcliffe efficiency (Nash and Sutcliffe, 1970) when the MSE exceeds the variance of the observation.For the most of the stations the RMSE score is less than 1, which indicates that the ACCESS model forecasts are more informative than the average of the observations.ACCESS-VT model This finding supports the hypothesis that orographically enhanced precipitation is underestimated by NWP models.The ACCESS-R model also shows a similar pattern, but the bias is much greater than that of high resolution models.The coarse resolution model ACCESS-G has a systematic positive bias (underforecasting) for all stations and bias generally increases with the latitude and altitude.The bias of the coarse resolution NWP model is up to 70 %.Other studies have reported the NWP biases on the order of 100 % (see e.g.Clark and Hay, 2004).As the model resolution becomes progressively coarser (i.e.regional and global models), large systematic biases emerge.
Unlike RMSE score, the bias shows some spatial pattern.Spearman rank correlation coefficients between the station precipitation and the corresponding ACCESS model forecasts are shown in Fig. 5c.The correlation coefficients of high resolution models ACCESS-VT, ACCESS-A, and the regional model ACCESS-R are comparable and vary between about 0.7 and 0.8.In some stations like Rosenwhite (10), all these models give consistently lower correlation values (about 0.7) and stations like Cheshnut ( 23), all models give consistently higher correlation values (about 0.8).The correlations between station precipitation and ACCESS-G forecasts are generally lower than those of the higher resolution models, and vary across the stations.This may be due to mainly two reasons: (i) the ACCESS-G model resolution is coarse and the spatial variability of precipitation across the stations within a model grid cell is high; and (ii) the spatial variability of the forecast precipitation across the model grid cells is small.Thus the spatial variability of correlation coefficients (Fig. 5c) comes mainly from the variability of the observed precipitation across the stations and but not necessarily from the ACCESS-G forecasts.For example, Wangaratta AWS ( 4 at Mongans Bridge (11) may be due to forecast and/or observation outlier in a month of March 2011 (forecast of 150 mm against observation of 5 mm precipitation).Further analysis has been done to understand the contribution of bias and variance to RMSE (see Eq. 2).The variances of the forecasts and observations are of same order of magnitude.However, the biases of the precipitation forecasts from ACCESS models are much smaller than the standard deviations of the forecasts and observations and therefore, reducing the biases of the forecasts may not necessarily reduce the RMSE significantly.

Variation of evaluation scores with forecast lead times
NWP model skill varies with time for three main reasons: the quality of the initial analysis, baroclinic and/or barotropic instability of the large scale flow, and model systematic errors (Stanski et al., 1989).When model forecasts are accurate at the start of a model run it does not necessarily mean it will stay that way or vice versa.Even during the times when the models had more skill overall there can still be some hours where the forecasts are significantly less skilful (Roux and Seed, 2011).In this section we examine the skill of the NWP model forecasts at different lead times.We present analysis of forecasts from the ACCESS-G model because it has the longest lead time.We focus on a single precipitation station, Carboor Upper (13), which is close to the centre of the Ovens catchment and an ACCESS-G model grid cell, and analyse forecast skill of three hour precipitation accumulations.Analysis of other models and locations produce similar results.The score for 3 h precipitation accumulations at 3 h lead time means the score of total precipitation for the period 09:00-12:00 LT.
Figure 6 shows the forecast skill of 3 h precipitation accumulations for the ACCESS-G model at Carboor Upper station.The RMSE score (Fig. 6a) displays considerable variation with lead time.The RMSE score is below 1 for lead times up to 39 h and subsequently fluctuates around 1. Figure 6b shows that the forecast bias varies significantly at different lead times and shows some diurnal cycle.Further investigation into the diurnal cycle is presented in Sect.4.5.The forecasts have a bias of up to 75 % Introduction

Conclusions References
Tables Figures

Back Close
Full and consistently underestimate 3 h precipitation accumulations for most lead times.
Figure 6c shows the Spearman rank correlation coefficient between forecast and observed of 3 h precipitation accumulations.One can see that the correlation coefficient decreases with lead time which is not obvious in other two scores mentioned before.The correlation coefficient starts with a value of about 0.6 at the shortest lead time and decreases to a value of about 0.1 at the longest lead time.
Figure 6 also shows the 95 % confidence intervals of sampling uncertainty for the evaluation scores using 10 000 number of samples.Although this number seems somewhat arbitrary, an analysis of the convergence of the mean of evaluation scores (results not shown) suggests that this number is sufficient.The top panel shows that the RMSE score has a considerable sampling uncertainty (light shaded area) which varies at different lead times.Particularly at 39 and 138 h, the uncertainty is very large, indicating that some extreme events strongly influence the RMSE score.Further analysis of forecasts at these lead times shows on the one hand the model is not able to forecast some extreme events, but on the other hand the model is producing unnecessarily very large forecasts for some low events.
Figure 6b illustrates sampling uncertainty in the bias score.Like the RMSE, this score also reveals that there is a considerable sampling uncertainty and particularly at 42, 75, 123 h, and some other forecast hours, uncertainty of the bias score is very large.The 95 % confidence intervals of sampling uncertainty associated with the Spearman rank correlation coefficient is presented in Fig. 6c, which seems to be more symmetrical than for other scores.They are consistent with the correlation coefficients between precipitation forecasts and the corresponding observations and do not fluctuate like other scores as Spearman correlation is less sensitive to the extreme values.
Figure 7 shows the categorical evaluation scores and their 95 % confidence intervals as a function of forecast lead time.In this study threshold value of 0.1 mm (3 h) −1 is considered to define the precipitation event "yes" or "no".A non-zero threshold is imposed because there is a minimum measurable precipitation amount for the operational tipping bucket rain gauges.Figure 7a shows the POD scores of the model forecasts.As Introduction

Conclusions References
Tables Figures

Back Close
Full expected the score decreases with increasing lead time.For example, at the shortest lead time more than 70 % of the observed events are correctly detected, while at the longest lead times, the POD score reduces to 30 %. Like continuous scores, the sampling uncertainty is quite large.Figure 7b shows that the FAR score increases with lead time which is consistent with the POD score.The FAR score increases from a value of about 0.5 at the shortest lead time to 0.75 at the longest lead time.As far as uncertainty results are concerned, the FAR score behaves similar to the POD.The equivalent diagram for the FBI as a function of forecast lead time is shown in Fig. 7c.Unlike the POD and FAR scores, the FBI score does not increase with lead time, rather it fluctuates around a value of 1.3 and shows evidence of a diurnal cycle.Comparing the continuous (Fig. 6c) and frequency bias (Fig. 7c) of the forecasts produces an interesting result; forecasts of the precipitation amount tend to be too low, but the occurrence of precipitation is overestimated for most forecast lead times.This indicates that the model forecasts small amounts of precipitation too frequently.This is the well known behaviour of many NWP models and has been reported elsewhere.
One can notice a considerable sampling uncertainty in the FBI score as well.
The CSI score reported in Fig. 7d displays results similar to the POD and the FAR scores.The CSI score is similar to the POD except it also considers false alarms.If there are no false alarms, then both scores are equal.Thus the CSI score is smaller than the POD.For the ACCESS-G model forecasts, the score varies from about 0.45 at the shortest lead time to about 0.15 at the longest lead time.Likewise the uncertainty results are similar to that in the POD score; however the variation across the lead times is smaller.
The categorical skill of the reference forecasts is also shown in Fig. 7.The reference forecasts are generated using a permutation procedure (see e.g.Mason, 2008;Deque, 2012).The permutation procedure generates a new set of forecasts-observation pairs in which observation are unrelated to the forecasts except by chance.This procedure addresses the question of what is the chance that the given value of evaluation score could have been obtained by accident.The mean scores of 10 000 such reference Introduction

Conclusions References
Tables Figures

Back Close
Full forecasts are shown in dashed lines.Note that FBI of the reference forecasts is same as that of the ACCESS forecasts; hence it is not shown in the figure.The results show that ACCESS-G model might not necessarily have significant skill beyond 7 days given sampling uncertainty.

Variation of evaluation scores with precipitation accumulation periods
An analysis of evaluation scores of the ACCESS-models (except ACCESS-G) indicates that the skill of the hourly precipitation forecasts is very low and varies significantly from hour to hour (results not shown).However, there is some skill for forecasts of 3 h precipitation accumulations.Increases in forecast skill due to temporal accumulation arise because errors in the timing of precipitation decrease.In this section we have further analysed the scores of forecasts from the ACCESS-G model for different accumulation periods (Fig. 8).The RMSE score is the highest (about 1.48) at 136 h lead time for 3 h precipitation accumulations (Fig. 8a).This drops to 1.44, 1.41 and 1.22 for 6, 12 and 24 h precipitation accumulations, respectively.At the lead times between 36 and 72 h, the RMSE skill increases (or RMSE score decreases) significantly from shorter accumulation periods to longer ones.Further analysis of sampling uncertainty (not shown) supports the finding that skill at 24 h accumulation period is significantly better than skill at 3 h accumulation period at shorter lead times.For the longer lead times, the skills at all accumulation periods are not significantly different.
Figure 8b shows that the maximum bias of −75 % is reduced to −46 % when accumulation the period increases from 3 to 24 h at the lead times between 144 and 168 h.The bias of forecasts of 24 h precipitation accumulations decreases from −38 % at 1 day to −26 % at 3 days lead time and then increases to −54 % at the longest lead time.The model is overestimating 3 h precipitation accumulations for some lead times (e.g.51, 75, 99, and 123 h).For the corresponding periods, the biases of the 24 h precipitation accumulations are negative (underestimating) because the biases of other 3 h precipitation accumulations within these periods are negative and the net effect is negative.Introduction

Conclusions References
Tables Figures

Back Close
Full Figure 8c shows the Spearman correlation coefficients between forecast and observed precipitation as a function of lead time and accumulation period.The Spearman correlation coefficient displays less variation than the other two scores, because it is less sensitive to outliers and extreme events.The correlation increases from 0.52 at the shortest accumulation period (3 h) to 0.74 for the longest accumulation period (24 h) at the shortest lead time.It is observed that a plot of correlation coefficients for 24 h precipitation accumulations now exhibits smooth monotonic decay which now seems to have less affected by sampling fluctuations.
The analysis presented in this section suggests that, in general, the skill of ACCESS-G precipitation forecasts increases with increasing accumulation period.However, the appropriate accumulation period to adopt will depend not only upon the forecast skill but also upon the intended use of NWP precipitation forecasts.For example for flood forecasting applications, daily forecasts are likely to be too coarse as the flood peak may remain for only a few hours.For other purposes such as water resources management, hourly precipitation forecasts may not be needed.Further analysis is required to select the optimal temporal resolution for streamflow forecasting purposes.

Variation of evaluation scores with precipitation threshold values
In Sect.4.2, we presented the categorical evaluation scores of 3 h total precipitation forecasts from the ACCESS-G model for a threshold value of 0.1 mm (3 h) −1 .The skill of the NWP precipitation forecast may also be expected to vary with precipitation intensity.
Figure 9 depicts the categorical evaluation scores of the ACCESS-G forecasts as a function of precipitation threshold value.The scores are computed for forecasts of 24 h precipitation accumulations for lead times of 1 to 9 days.The categorical evaluation scores are strongly related to the threshold and in general, decrease with increasing threshold values.For example the POD score (Fig. 9a) decreases from about 0.8 for low threshold value (1 mm day −1 ) to about 0.35 for rain amounts above 20 mm day for forecasts of the first 24 h.Furthermore, as expected the POD score decreases with increasing lead times.The scores for the high threshold values must be used with care because only few cases may occur, for example 11.9 % of all cases occur for threshold greater than 10 mm day −1 , 7.2 % for threshold greater than 20 mm day −1 .
The remaining panels show the FAR (Fig. 9b), the FBI (Fig. 9c), and the CSI (Fig. 9d) scores.Consistent with the POD score, the FBI and the CSI decreases with increasing threshold values whereas the FAR score increases with increasing threshold values.At 1 and 2 days lead time, the FAR score decreases, whereas at 3 to 6 days lead time, it first decreases for low threshold values and then increases for higher values.From Fig. 9c it can be seen that, for a low threshold value (e.g.0.1 mm day −1 ) the FBI score is greater than 1 at all lead times whereas for higher threshold values it is less than 1.This indicates that occurrence of rain or light rain is overestimated while the heavy rain events are consistently underestimated.As far as CSI score is concerned, it increases slightly at low threshold values of 1 and 2 mm day −1 at shorter lead times (1 to 3 days) and then decreases, which is consistent with FAR score.
One sample t-test indicates that the evaluation scores for higher precipitation threshold values are significantly different (at 5 % significant level) than that of lower threshold values for all lead times.All evaluation scores except FAR for lower threshold value are significantly different for longer lead times.Further analysis shows that all evaluation scores except FBI for longer lead times (day 8 and 9) is significantly different for all precipitation threshold values.FAR and CSI scores for shorter lead times (day 1 and 2) are significantly different for all precipitation threshold values.Note that sample sizes for the significant test of evaluation scores at different precipitation thresholds and forecast lead time are 6 and 9, respectively.

Further results
Results from Fig. 6 indicate that there might be some diurnal cycle in the evaluation scores, particularly for the bias.We investigate the diurnal cycle of the observed precipitation and corresponding ACCESS-R model forecasts at Carboor Upper station.

Conclusions References
Tables Figures

Back Close
Full ACCESS-R is chosen for this analysis because ACCESS-G precipitation forecasts are not available at hourly temporal resolution and a more thorough analysis of the diurnal cycle would require a forecast length beyond 24 h. Figure 10 shows the diurnal cycle of observed precipitation at the station and the ACCESS-R forecasts for the corresponding model grid cell.Observed precipitation displays a diurnal cycle, with maximum at 07:00, then 10:00 and 11:00 UTC and minimum at 01:00 UTC.This finding for Carboor Upper station is consistent with results reported by Westra and Sharma (2010) that the hourly maximum and minimum in precipitation occurrence was found between 08:00 and 10:00 UTC and between 23:00 and 24:00 UTC respectively for more than 80 % of Australian stations.The precipitation forecasts do not seem to have a diurnal cycle except the outlier at 13:00 UTC which is the first hour of the forecast.Poorly representing the timing and magnitude of the diurnal cycles, particularly in precipitation, is a known problem with many NWP models and is commonly related to the representation and parameterisation of convective processes (Kaufmann et al., 2003;Dai and Trenberth, 2004;Evans and Westra, 2012).
Since it is difficult to see the diurnal cycle of the evaluation scores of the ACCESS-R model (because the lead time is only up to 60 h), we have further analysed the diurnal cycle for the bias of the ACCESS-G model.From Fig. 6b, some of the lowest biases are at 27, 51, 75, 99 h lead times, which corresponds to 12:00 LT.This is consistent with the minimum value of observed precipitation which occurs at 11:00 (12:00 during daylight saving) LT.Similarly the maximum bias occurs at 21:00 LT while the maximum values of observed precipitation are around 18:00-21:00 LT (daylight saving time).Thus there is some consistent between the timing of hourly maximum and minimum of the observations and the bias score.The cyclic nature of the biases in the ACCESS-G model precipitation forecasts is likely the product of the limited ability of the model to describe the diurnal cycle.Furthermore, given that the precipitation forecasts do not seem to have a pronounced diurnal cycle, the bias score being linear (as opposed to e.g.RMSE which is quadratic) exhibits similar cyclic patterns as the observations.Further analysis on synthetic data (not shown) supports the finding that the evidence of Introduction

Conclusions References
Tables Figures

Back Close
Full diurnal cycle in observation is likely to be seen in the bias score compared to RMSE score or correlation coefficient.

Evaluation at catchment scale
Previous sections presented the evaluation scores of the ACCESS model precipitation forecasts at point scales (i.e. at rain gauge station).For hydrological applications the localisation of precipitation is important at the catchment scale so that it is useful to evaluate precipitation forecasts on catchment averages (e.g.Oberto et al., 2006;Rossa et al., 2008).The catchment average precipitation is used as the input to lumped hydrological models when forecasting streamflow.Furthermore, the sensitivity of the streamflow forecasts to the errors in catchment average precipitation is higher than to the stations precipitations because of a smoothing effect.Figure 11 gives the performance scores for forecasts for 3 h accumulations of catchment average precipitation.These results smooth over some of the errors related to displacement and are a better indicator of the quality of forecasts of precipitation volume.Compared to station precipitation (Fig. 6a), the RMSE score (Fig. 11a) of the catchment average precipitation exhibits similar pattern, but the magnitude of the score is lower.As expected, the RMSE score of the ACCESS-G model, in general, increases with increasing lead times (e.g.0.8 at 3 h lead time to about 1.0 at the longest lead time).The 95 % sampling uncertainty plot shows that there is a considerable sampling variation in the RMSE score.For several lead times (e.g. at 42, 180 h), the uncertainty is very large, indicating that some extreme events strongly influence the RMSE score.precipitation.The uncertainty analysis shows that there is also a considerable sampling uncertainty in the bias score and this is not surprising given that only one year of data is used which has some extreme precipitation events.Figure 11c shows the Spearman correlation coefficient between observed and catchment average rainfall forecasts of 3 h total from ACCESS-G model.Unlike other two scores, correlation exhibits a relatively smooth decay as the lead time increases.The correlation coefficient declines from about 0.7 at the shortest lead time to about 0.22 at the longest lead time.The 95 % sampling uncertainty shows similar behaviour like that of the station precipitation (Fig. 6c).
Figure 12 shows the categorical evaluation scores for catchment average precipitation forecasts.In general, these scores exhibit similar patterns to station precipitation.However, as expected the skill of the catchment average precipitation forecasts are higher.The POD of the ACCESS-G model forecasts decreases from about 0.68 at the shortest lead time to about 0.38 at the longest lead time (Fig. 12a).Similarly, the FAR of the ACCESS-G model forecasts increases from about 0.2 at the shortest lead time to about 0.53 at the longest lead time (Fig. 12b).Figure 12c shows that there is a significant variation in the FBI score across the lead times and a diurnal cycle similar to that of station precipitation is present (Fig. 7c).However the FBI of the catchment average precipitation forecasts is less than 1 for most lead times, whereas it is greater than 1 for the station precipitation.This difference is logical because if there is precipitation at any measurement station within the catchment, then the catchment average precipitation is non-zero and the probability of observed rain events is higher.The last panel shows that the CSI score, like other scores, is higher than that of the station precipitation.It decreases from about 0.6 at the shortest lead time to about 0.27 at the longest lead time.Uncertainty analysis of the categorical evaluation scores are also reported in Fig. 12.The results are consistent with the continuous evaluation scores.The mean scores of the reference forecasts are shown in Fig. 12.The results are consistent with the station precipitation that the ACCESS-G model is unlikely to have significant skill beyond 7-8 days.Introduction

Conclusions References
Tables Figures

Back Close
Full

Discussion
There is a general perception that the variability of NWP model output does not match the observed variability.Specifically, it is thought that there is a tendency for too frequent small amounts of precipitation in the NWP model output.NWP models are much less successful in their handling of low level stratiform cloud, and generally have a tendency to overestimate light rain (Golding, 2000).The results (not shown) reveal that the ACCESS models have a tendency to have too many small rain events.For events less than 0.13 mm h −1 the model frequency is greater than that of the observed data.
For events between 0.13 and 2 mm h −1 , the models do not produce enough events.
For events greater than 2 mm h −1 , the sample size is not big enough to draw reliable conclusions.For all precipitation frequencies the ACCESS-G global model does not produce enough intense events.The NWP forecasts and observations are highly skewed and the error does not necessarily appear to be linear in log-transformed space.Specifically, both time series contain many zeros and the relative error can be very large for small precipitation amounts.Any kind of NWP post-processing would need to transform the observations and forecasts in a way that the variables or residuals are relatively normally distributed.However, undue weight should not be placed on the small precipitation amounts as these are relatively inconsequential for flood and streamflow forecasting applications.
The NWP models do not appear to be the most skilful at their native resolutions (i.e. hourly for individual grid cells).There is greater skill when the NWP models are averaged over coarser spatial and temporal resolutions (e.g.catchment average, daily average).Further analysis is necessary to determine the optimal resolution for extracting useful information from NWP, as this resolution may depend on the catchment and/or season.However, any techniques for quantifying NWP forecast uncertainty that use only the native resolution data may unnecessarily conclude that the NWP forecasts contain no skill.Introduction

Conclusions References
Tables Figures

Back Close
Full Prior to the commencement of this study, it was anticipated that the NWP precipitation forecasts would have significant and systematic biases that would have to be corrected to make them useful for predicting streamflow.Even if the precipitation forecasts were good at predicting the "true" precipitation (i.e.what actually fell on the catchment), the "measured" precipitation may depend on the mix of available station data.
Operational datasets used for streamflow forecasting contain a subset of the full station network because of the requirement that data be available in real-time and have a long records.As a result the geographic characteristics of stations used for operational streamflow forecasting may not be representative of the catchment as a whole (e.g.clustered in valley bottoms).The data for the Ovens catchment used in this study passed through a thorough quality control and infilling process which produced serially complete hourly data at stations, checked against an independent gridded precipitation dataset.Such processes often cannot be performed in real time and therefore the observed data used in this study are closer to the true precipitation than the data currently used in the operational system (which does not check for flat lined sensors and does not infill missing data).
Somewhat surprisingly, when the NWP model resolution is comparable to that of the hydrological model sub-catchments (in this study, about 8 km), the NWP bias is low.It may not require correction to be useful and the operational strategy of using the raw NWP without downscaling or bias correction may be a fair approximation.Of course, this is just one catchment and one year of forecasts and this should be tested over a longer period and in different regions.Furthermore, the NWP may match the high quality and infilled observations but it may not match the current real-time data streams.
The skill of the precipitation forecasts from the NWP models at two nearby stations can be quite different because (model's inability to resolve the scale).In this study, precipitation forecasts from the NWP models are compared with station and catchment average precipitation whose spatial resolution is different than that of the model.A better understanding of the quality of forecasts would be gained if the spatial resolution of the model matches with that of the observation.It would be also interesting to compare the quality of NWP model forecasts to some reference forecasts such as climatology or persistence.This study has evaluated the precipitation forecasts for conditions where precipitation is principally due to large scale synoptic systems.Large scale synoptic systems tend to be better predicted by NWP models because they tend to evolve relatively slowly and occur on spatial scales that are resolved by the models (Roux and Seed, 2011;Roux et al., 2012).NWP models tend not to predict precipitation from convective systems well because there processes evolve rapidly and commonly occur on spatial scales finer than those resolved by the model.Further work has been planned to extend experiments for catchments experiencing a range of climatic conditions in Australia, particularly in areas where significant precipitation is the result of convective processes.

Conclusions
This study evaluates the performance of precipitation forecasts from the latest generation of Australian Numerical Weather Prediction (NWP) models over the Ovens catchment in southeast Australia.The precipitation forecasts from four NWP models (viz.ACCESS-G, ACCESS-R, ACCESS-A and ACCESS-VT) are compared to observed precipitation at measurement stations and to interpolated catchment average precipitation over one year period.A number of continuous and categorical evaluation scores have been used to assess the skill of the ACCESS models at different lead times and temporal resolutions.The effect of diurnal cycle of the precipitation observations and sampling uncertainty in the model performance is also investigated.Introduction

Conclusions References
Tables Figures

Back Close
Full The results show that the skill of the NWP precipitation forecasts varies a lot across the stations and are not strongly related to spatial and precipitation pattern, although they indicate some structure with respect to the latitude and altitude of the stations.The high resolution models ACCESS-VT and ACCESS-A overestimate 24 h precipitation accumulations in dry, low elevation areas by up to 60 % and underestimate 24 h precipitation accumulations in wet, high elevation areas up to 30 %.The low resolution model ACCESS-G underestimates 24 h precipitation accumulations by up to 70 % over all stations and in general, the bias increases with the latitude (and altitude).The correlation of the high resolution NWP (ACCESS-VT, and ACCESS-A) and the regional (ACCESS-R) models is as good as of the low resolution model (ACCESS-G).Overall, high resolution NWP models capture the variability of the precipitation across the stations and perform better at predicting aggregated precipitation amount that the precise location or timing of the precipitation.There is a tendency for small amounts of precipitation to be forecasted too frequently by the NWP models.
The skill of the NWP model forecasts varies significantly with forecast lead time.In general, forecast skill decreases with the lead time, however there are many instances where the skill at shorter lead times is lower than at longer lead times.This can be attributed to mainly sampling and diurnal variation.Observed precipitation displays a diurnal cycle, with maximum mean precipitation occurring between 17:00 and 21:00 LT, while the NWP precipitation forecasts fails to capture the cycle.Consequently some evaluation scores such as bias and frequency bias show the evidence of the diurnal cycle which is consistent with that of the observation.Uncertainty analysis reveals that the evaluation scores have a significant sampling variation.The NWP forecasts appear to have little skill when evaluated at a short temporal resolution (e.g.hourly or 3 hourly).
The skill of the forecasts increase with increasing precipitation accumulation periods (at least up to 24 h) because timing errors in individual periods will tend to compensate for each other.
The skill of the ACCESS model forecasts is higher at the catchment scale than for measurement stations.Spatial averaging of precipitation over a catchment reduces Introduction

Conclusions References
Tables Figures

Back Close
Full Future work is planned to assess the benefits of using the NWP precipitation forecasts for short-term streamflow forecasting.Our findings here suggest that it is necessary to remove the systematic biases in precipitation forecasts, particularly those from low resolution models, before the forecasts can be used for streamflow forecasting.Post-processing techniques to remove biases and reliably quantify precipitation forecast uncertainty are being currently developed and tested by the authors.Introduction

Conclusions References
Tables Figures

Back Close
Full  Full  Full streamflow with lead times up to 10 days are important for water resources management and mitigating impacts of floods.Streamflow forecasts are produced by initialising the state variables of a hydrological model to their condition at the forecast Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | has been the operational NWP system employed by the Australian Bureau of Meteorology (BoM) since August 2010.The ACCESS NWP model system is based on the UK Met Office's Unified Model/Variational Assimilation (UM/VAR) system with multiple resolutions and spatial domains extending from a course resolution global model down to the high resolution city-based models.This study uses the initial rollout of the ACCESS system APS0 (Australian Parallel Suite version 0).The APS0 version Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |higher resolution models.The ACCESS-R model also has a tendency to overforecast in lowland areas and underforecast in highland areas.The coarse resolution model ACCESS-G places the highest precipitation over the north west of the catchment.Like the ACCESS-R, the ACCESS-G model has a precipitation minimum in the headwaters of the catchment.Note that the ACCESS-G model has only 4 grid cells to cover the entire catchment.Unlike other models, the ACCESS-G underestimates precipitation over the entire catchment.The ACCESS-VT and ACCESS-A model forecasts appear to capture the gradient of precipitation across the catchment although they appear to have less variability than the observations.The ACCESS-R and ACCESS-G model resolutions do not meaningfully represent the fine scale patterns of variability across the catchment.Clearly, downscaling and bias adjustment are operationally recommended for the ACCESS-R and ACCESS-G models.
has a minimum RMSE score of about 0.50 for Mount Buffalo station (17) and a maximum value of 1.16 for Angleside station (12).ACCESS-A model has the highest RMSE score at Bald Hill station (31).ACCESS-G model has lower RMSE scores in the lower elevation areas compared to the higher elevation areas.The ACCESS-R model has average values of the RMSE score compared to other models.In general, the RMSE score does not exhibit any strong spatial pattern.Figure 5b depicts the bias of the ACCESS model forecasts as a percentage of the observed values.The ACCESS-VT and ACCESS-A models overestimate dry (low elevation) areas by up to 60 % and underestimate wet (high elevation) areas by up to 30 %. Discussion Paper | Discussion Paper | Discussion Paper | ) and Mount Buffalo (17) stations share the same value of precipitation forecasts as they lie in the same grid cell of the ACCESS-G model, but have quite different observed precipitation (mean daily values of 2.77 mm vs 8.81 mm).The variability of mean daily precipitation across the stations (standard deviation of 1.41) is much higher than that of the ACCESS-G model (standard deviation of 0.32).The very low value of correlation Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Figure 11b depicts the bias score of catchment average precipitation forecasts from the ACCESS-G model.Systematic biases in the forecasts are evident, where it struggles to produce high enough intensity forecasts.The bias of the ACCESS-G model forecasts is around −48 % at the shortest lead time, then fluctuates around −50 %, and finally reaches to about −67 % at the longest lead time.The ACCESS-G model forecasts tend to be lower than the interpolated catchment average precipitation.Like station precipitation, a diurnal cycle is present in the bias score of catchment average 12586 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | (i) they (the stations) are in same model grid cell, but have different precipitation observation (observed variability), or (ii) they are in different model grid cells, thus have different forecasts (forecast variability) or (iii) they are in different model grid cells, have similar forecasts, but different precipitation observation Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |displacement errors and provides a better indicator of the quality of the forecast of precipitation volume.Systematic biases in the global ACCESS model are also evident in catchment average precipitation forecasts.The model struggles to produce high enough intensity forecasts.The resolution of the global model is too coarse to resolve the small catchment scale.
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper |

Fig. 4 .Figure 5 .
Fig. 4. A comparison of daily average (1-24 h accumulated) observed precipitation at stations and forecasted precipitation by the ACCESS models in the Ovens catchment for 1 April 2010 to 8 February 2011: (a) Observed station precipitation, (b) ACCESS-VT, (c) ACCESS-A, (d) ACCESS-R, and (e) ACCESS-G.

Fig. 5 .Fig. 6 .
Fig. 5. Evaluation scores of the ACCESS models for daily precipitation forecast at stations: (a) RMSE, (b) bias, and (c) Spearman rank correlation.Stations are ordered according to latitude then longitude.

Fig. 8 .Fig. 9 .
Fig. 8.A comparison of the evaluation scores of the ACCESS-G model for different temporal precipitation accumulation periods at Carboor Upper station: (a) RMSE, (b) bias, and (c) Spearman correlation coefficient.

an area of 5552 km 2 . The upper catchment is steep and hilly, and covered by native forest and tree plantations. The lower catchment is relatively flat with a wide floodplain
accumulation period is for 24 h from 09:00 LT and the precipitation is averaged over a period for 1 April 2010 to 8 February 2011.Here, dark blue colour indicates higher precipitation and white is relatively drier.The ACCESS-VT model has a precipitation maximum adjacent to the eastern extremity of the catchment just to the west of Mount Bogong (elevation 1988 m, longitude 147.2 • , and latitude −36.8 • , between stations 11 and 26).The average daily precipitation forecast at this location is about 10.5 mm.Regrettably the area of the highest forecast precipitation is without a measurement station.The closest measurement station (26) is about 10 km south east of the highest precipitation forecast location.This station has observed precipitation of 5.89 mm, while the corresponding grid cell forecast by the ACCESS-VT model is 7.95 mm.The measurement stations with the highest precipitation observations are 17 (8.81mm), 33 (7.03 mm), and 30 (6.88 mm).The forecast precipitation for the corresponding model grid cells for these stations are 7.26 mm, 4.97 mm, and 5.98 mm, respectively.The ACCESS-VT model has a tendency to overforecast in lowland areas (north of the catchment) and underforecast in highland areas (south of the catchment).

Table 1 .
Precipitation forecasts available from NWP models for the study.

Table 3 .
Contingency table of binary events for categorical verification scores.