A rule based quality control method for hourly rainfall data and a 1km resolution gridded hourly rainfall dataset for Great Britain: CEH-GEAR1hr.

High-resolution gridded precipitation products are rare globally, particularly below a daily time-step, yet many hydrological applications require, or can be improved by, a higher temporal resolution of rainfall data. Here, we present a new 1km resolution gridded hourly rainfall dataset for Great Britain (Gridded estimates of hourly areal rainfall for Great Britain (1990 – 2014) [CEH-GEAR1hr]) using data from over 1900 quality controlled rainfall gauges, which improves upon the current UK national gridded precipitation datasets at daily time-step. We extend and automate a quality control (QC) procedure to permit the use of hourly data for 1990 – 2014 and independently validate the QC using daily rainfall data and recorded historic events. Our two-tiered validation approach, at daily and hourly timescales, indicates that spurious extreme values are excluded from the resultant dataset, while legitimate values are preserved. We use a nearest neighbour interpolation scheme to derive gridded hourly rainfall values at 1km resolution, to temporally disaggregate the CEH-GEAR daily gridded dataset and produce an hourly dataset with consistent daily totals. This provides a unique resource for hydrological applications in Great Britain. The CEH-GEAR1hr dataset, associated metadata and QC information, will be freely available from the Environmental Information Data Centre (EIDC) and hosted alongside the daily and monthly CEH-GEAR product.


Introduction
Recently, increased attention has been given to sub-daily precipitation observations due to the contribution of intense rainfall events to flash flooding in urban areas and fast-responding catchments. Indeed, our ability to address and plan for flash floods has been partly limited by the paucity of available high-quality rain gauge data (Westra et al., 2014). Several studies have demonstrated the sensitivity and improved performance of hydrological model simulations when driven by precipitation data at a sub-daily time step (e.g. Finnerty et al., 1997;Bastola and Misra, 2013), particularly for small catchments with rapid response times. Additionally, the lack of temporal resolution offered by daily data for direct application in flood forecasting and the need for assessment of the impacts of short-duration intense rainfall events on hydrological systems has created a requirement for the improved availability of sub-daily precipitation data. This was further identified as a need in the World Climate Research Programme Grand Challenge on Extremes (Alexander et al., 2016) and the INTENSE project has taken up the mantle of collecting and quality-controlling a global subdaily precipitation dataset (www.research.ncl.ac.uk/intense). Such datasets are also invaluable for the validation of the new generation of very high-resolution convection-permitting climate models (CPMssee Prein et al. (2015) for a review) which offer improved representation of sub-daily extreme rainfall (e.g. Kendon et al., 2012;Kendon et al., Bell et al., 2007;Yang et al, 2014), the assessment of historical climate and its variability (e.g. Blenkinsop et al., 2008;Simpson and Jones, 2014;Becker et al., 2013;Yu et al., 2016) and the assessment of reanalysis and downscaled climate model products (e.g. Isotta et al., 2015). Existing datasets of ground-based observations are typically on daily timescales, but gridded hourly products offer the potential for enhanced applications in these areas as well as for the verification of quantitative precipitation forecasts, satellite products and the validation of CPMs. Radar data offers the required temporal resolution but suffers from errors in the estimation of precipitation magnitude (Collier, 1989;Villarini and Krajewski, 2010) and may lead to reduced performance of hydrological models compared with gauge-derived data (Cole and Moore, 2009;Parkes et al., 2013).
A number of gridded precipitation products derived from rain gauges are available for the UK and are summarised in Table S1 in the Supplementary Information (comprising UK-only and Europe-wide datasets). Reanalysis data, which are created by a data assimilation scheme and models which incorporate observations at 6-12 h timesteps, may also be used to characterise the observed long-term variability of precipitation (e.g. NCEP/NCAR: Kalnay et al., 1996;20CR: Compo et al., 2011;ERA-Interim: Dee et al, 2011). However, these are typically at coarser spatial resolutions and are not appropriate for most hydrological modelling studies. For example, Rhodes et al. (2015) noted that although reanalysis products represent many of the features of large-scale precipitation and daily totals over England and Wales, individual extreme events are less well represented. Regional reanalyses may address some of the problems associated with coarse scale reanalysis datasets though improved understanding of uncertainties is needed (Borsche et al., 2015).
Gridded datasets at hourly resolution have been constructed for some regions using a range of methodological approaches although these are typically only for short time-periods, and often constructed for the calibration and assessment of hydrological models in catchments. For example, Wüest et al. (2010) used a dense daily rain-gauge network for Switzerland for 1992-2003 and disaggregated this to the hourly timescale using radar data to preserve daily totals. A similar approach was used by Paulat et al. (2008) to create a gridded hourly dataset for Germany for the period 2001-2004. Shen et al. (2014 merged over 30,000 hourly gauge observations with satellite data to produce a gridded hourly dataset for China for the 2008-2010 warm seasons. An alternative approach was applied by Luo et al. (2013) who constructed an hourly precipitation grid over the Yangtze-Huai Rivers Basin in China for the 2007 "mei-yu" season using a direct interpolation method with automated weather station (AWS) data. Some national scale subdaily datasets have been produced however. Vormoor and Skaugen (2013) developed a 1 km 3 h gridded precipitation dataset for Norway using a model simulated hindcast series to disaggregate an existing gridded daily dataset for the period 1957-2010. Hourly observations for approximately 2,500 stations in the US have also been interpolated onto a relatively coarse (2°latitude by 2.5°longitude) grid  and a gridded product created through merging of daily gauges with radar (Cosgrove et al., 2003) up to a resolution of 1/8°.
For the UK, generation of gridded sub-daily precipitation data on a catchment-scale over limited periods has been performed on an ad hoc basis for the assessment of hydrological models and their input data. Typically, this has been achieved by disaggregating a relatively dense daily rain gauge network using radar (e.g. Parkes et al. (2013) for the 4062 km 2 of the Upper Severn River; Cole and Moore (2008) for 136 km 2 of the River Darwen and 212 km 2 of the River Kent catchments). The availability of quality-controlled individual hourly rain gauge data for the UK (Blenkinsop et al., 2017, and extended here) offers the potential for a more extensive gridded hourly dataset for UK coverage.
Quality control of gauge rainfall data is essential to ensure a high quality product. There are many shortcomings of gauge measurement of rainfall including mechanical errors, recording errors, evaporation from partly-filled buckets, wind-induced under-catch, and snow-effects (McMillan et al., 2012). QC procedures can identify some of these errors, for example, frequent tips (Upton and Rahimi, 2003), erroneous extreme values and erroneous dry periods (Abbott, 1986). However, QC procedures are unable to identify errors due to undercatch and evaporation. Therefore the measurements may still not reflect the true amount of rain that fell even after QC checks are completed. Quality control is typically a manual operation and in some instances quasiautomated, but still relies heavily on either a final manual inspection or on previously manually identified errors for model training. Prior to this work, analysis of the UK rain gauge data has focussed only on a subset (< 20%) of available rain gauges due to the need for nearcomplete records for climatological analysis and the labour-intensive nature of quality controlling such data (Blenkinsop et al., 2017). This paper builds on Blenkinsop et al. (2017) by: i) developing additional quality control procedures whose implementation may be automated and applied to all ∼1900 gauge records, and ii) using these data to produce a gridded (1 km) hourly dataset for the UK for 1990-2014. In Section 2 we describe the data sources, Section 3 describes the extended and automated quality control procedure for the hourly rain gauge data, which includes a rule base for the implementation of single-site and nearest neighbour gauge checks. Section 4 details the validation of the automated QC process. Section 5 describes the resulting gauge dataset. Section 6 describes the methodological basis for the disaggregation of the daily gridded dataset. Section 7 assesses the reliability of the hourly gridded dataset and finally, in Section 8 we discuss the implications and potential applications of this dataset and potential future avenues which could improve the product and be used to investigate the associated uncertainties.

CEH-GEAR 1 km daily dataset
To generate gridded hourly precipitation values we adopt a similar approach to that used in other studies (as discussed in Section 1), by disaggregating an existing quality controlled, validated gridded daily precipitation dataset. Here we use the CEH Gridded Estimates of Areal Rainfall (CEH -GEAR) dataset Tanguy et al., 2016), which is an open source dataset that provides 1 km gridded estimates of daily and monthly rainfall for Great Britain and Northern Ireland from 1890 to 2014. The rainfall estimates in this dataset are derived from UK Met Office rain gauge observations, which were gridded using a natural neighbour interpolation method (a smooth, weighted version of nearest neighbour interpolation). We extracted the daily gridded estimates for 1990-2014 and temporally disaggregated these using an expanded version of the quality-controlled hourly rain gauge dataset of Blenkinsop et al. (2017). Thus, the new, hourly gridded dataset, CEH-GEAR1hr, preserves the daily totals of, and is consistent with, the widely used CEH-GEAR dataset. This allows the hourly gridded dataset to be made open source (currently the hourly rain gauge data is not freely available) and also allows direct comparisons to be made between the different temporal resolutions.

Hourly rain gauge data
To disaggregate the daily gridded dataset, a large dataset of hourly accumulations from 1903 UK rain gauges (see Fig. 1 for coverage) was used including a mixture of tipping bucket (TBR), 15 min and hourly data. These data were derived from the UK Met Office Integrated Data Archive System (MIDASdownloadable from the British Atmospheric Data Centre, Met Office, 2012), the England Environment Agency (EA), Natural Resources Wales (NRW) and the Scottish Environmental Protection Agency (SEPA). This dataset was originally created and used in Blenkinsop et al. (2017). For use in this paper, the dataset was extended to the end of 2014 and an additional 216 gauges were included (all of less than 10 years duration) from MIDAS. Blenkinsop et al. (2017) identified significant errors within these data and noted that the original dataset required additional quality control (QC) due to problems which included the recording of accumulated totals (principally daily), unfeasibly large hourly and daily values, unrecorded non-operation of gauges and, in the case of TBR data, unrealistic, high frequency tipping. In particular, internal QC procedures employed by the EA, including the use of check gauges, were noted to be affected by the institutional administrative structure and also varied in time. A series of single-site gauge tests were applied to identify the most egregious errors; however, a significant amount of time-consuming manual inspection of gauge records was required to produce the Blenkinsop et al. (2017) dataset, which comprises a subset of 376 near-complete quality controlled gauges covering the period 1992-2011. These limited QC procedures identified two clear requirements to enable the maximum value to be derived from the entire dataset. Firstly, the incorporation of additional QC methods to include those which use neighbouring gauge data as an additional check. These are routinely used for the QC of daily data (e.g. Keller et al., 2015;Sciuto et al., 2009) but their application on hourly timescales is problematic given the localised nature of convective storms. Secondly, the development of an automated QC 'rule base' that integrates the results of the various QC tests, to negate the requirement for large-scale manual data inspection. This requires the identification of appropriate metrics by which to assess the performance of the rule base and the overall quality of the dataset, and these are described in Section 4. This also provides the potential to further develop these formalised rules to either improve QC procedures as more understanding of the data is realised, to investigate the uncertainty associated with the QC process, or to have a different set of rules for specific applications.

Automated quality control tests
The quality control procedure is a three step process: 1) The gauge data is compared to the gridded CEH-GEAR daily dataset to identify suspect gauges which may subsequently be excluded from the dataset. 2) A series of quality control tests are applied to identify suspect values at all gauges which are marked with a quality control flag (but not excluded at this stage). 3) Combinations of quality control flags for a given hourly accumulation (referred to here as 'rules') are used to determine which flagged data are treated as erroneous values and thus excluded from the gauge records. 3.1.
Step 1. Comparison of gauge data to gridded daily dataset We first compared the (pre-QC) hourly gauge data (accumulated to 24 h) to the CEH-GEAR gridded daily dataset to estimate the initial quality of the rain gauge data and provide a baseline against which the QC process may be assessed. For each gauge we selected the CEH-GEAR grid cell over its location and used the Spearman's rank correlation coefficient (r s ) and percentage correct statistics (P11, P00) (Wilks, 2006, Yoo andHa, 2007) for comparison of the two time series. The latter is here calculated as the proportion of days in which rainfall is greater than 0 mm in both records (P11) or is dry in both records (P00) and is therefore a measure of concordance in rainfall occurrence. Whilst CEH-GEAR may not provide an accurate representation of rainfall at a point location, as it is an interpolated product and may also contain its own errors, the comparison is helpful to highlight potential errors in the gauge records which can then be examined further (recorded as suspect by QC9 in Table 2). Fig. 2 demonstrates that the majority of gauges (92.9%) match well with the corresponding grid cell (where r s > 0.8 and P00 + P11 > 0.8). However, the long tail of gauges outside these bounds highlights the need for rigorous quality control of the gauge data.

3.2.
Step 2. Identification and flagging of suspect gauge data Blenkinsop et al. (2017) applied a series of single-gauge QC tests independently to each rain gauge. We supplement this by comparing the hourly data with that of neighbouring gauges. In total therefore, we apply 15 QC tests: 11 single-gauge QC tests and 4 neighbourhood-gauge QC tests (identifying dry spells and high values, each applied seasonally), with each hourly value allocated a flag if a potentially suspect hourly/daily total is identified.
The single-site gauge QC tests are based on the understanding of rainfall processes and known measurement practice (see Blenkinsop et al., 2017 for further details). For example, we include checks against known rainfall records and checks for common instrumentation errors such as accumulations (Table 1, flags 1 to 11).
Some types of errors, such as reporting or instrument errors, are unlikely to be duplicated across gauge networks and are more readily detected by comparison with neighbouring gauges. We use neighbourhood analysis to assess whether measurements at a gauge of interest are statistical outliers when compared with those of their similar neighbours (Table 1, QC flags 12 to 15). Such approaches are most useful when the correlation decay distance is high, e.g. for temperature, but have also been used previously in the QC of rainfall datasets, see e.g. Eischeid et al (1995), Upton and Rahimi (2003), Sciuto et al. (2009) and Keller et al. (2015). Given the high variability in hourly rainfall totals, we apply these techniques to 24 h (daily) aggregations. See Supplementary Information for a detailed description of the neighbourhood analysis methodology.

Step 3. Application of a rule base to exclude suspect values
One of the key objectives of this work was to develop a quasi-automated procedure to interpret data flagged as potentially suspect in order to minimise the need for manual intervention in the QC process. This would mean that the QC procedure could be a) modified relatively easily and efficiently, b) the number of rules applied could be changed for different situations or analyses if less strict criteria were required, and c) the QC automated process could be applied to other comparable rain gauge datasets. The rule base presented in Table 2 uses the QC flags determined in the previous step (Table 1) in an intelligent manner based on knowledge of regional rainfall processes and characteristics and common errors in the rain gauge data (Blenkinsop et al., 2017). It thus comprises a set of 20 rules (Rn) that combine the QC flags either individually, in combination with other flags or in relation to other data characteristics such as dry sequences. Its aim is to fulfil the two criteria of excluding the most egregious errors in the data but also simultaneously preserving 'real' extreme values. Whilst the QC flags applied in step 2 provide valuable information on suspect data in the dataset, individual flags may not identify erroneous data effectively. For example, the threshold-based tests (QC1-3) are derived from the UK Met Office gauge network but this does not include the additional EA gauge data that significantly increases the coverage of the UK and may capture previously unrecorded events. A judgement is therefore required for such tests as to the appropriate thresholds at which a value may confidently be judged to be erroneous. The use of different thresholds allows the testing of different levels of severity in the implementation of the QC tests. Here, we judge that marginal threshold exceedance is insufficient on its own to identify erroneous data and so only automatically exclude such data if the UK record is exceeded by at least 20% (Table 2, R1). For smaller exceedances, data is only excluded if further evidence of problems with the data exists (Table 2, R2 -R7) as is the case for those values with non-threshold flags (R8 -R13).
The application of the neighbouring gauge checks also needed careful consideration. Initially these were implemented throughout the year but it was noted that this resulted in the exclusion of a welldocumented event at Boscastle, south west England in 2004 (Doe, 2004). As noted previously, the most intense hourly rainfall typically Fig. 2. Comparison of hourly gauge data (accumulated to 24 h) with the corresponding CEH-GEAR daily time series for 1903 gauges. P00 + P11 is the proportion of days in which it rains in both records or it is dry in both records, r s is the Spearman's rank correlation coefficient. occurs from late spring to early autumn (Blenkinsop et al., 2017) and can be highly localised. This causes a rapid decrease in correlation between gauges with distance in summer months resulting in: i) daily accumulations differing significantly to the surrounding neighbours, ii) dry spells that are not recorded in surrounding neighbours. To allow for this highly localised nature of extreme rainfall events in summer months, the implemented rule base only applies high threshold value neighbourhood checks in winter (R20).
The final hourly precipitation gauge dataset includes the provision of all QC flags as gauge metadata. This ensures that all QC decisions are both transparent and traceable, and that users are able to test alternative rules and apply a custom rule base if required as appropriate for their application. For example, if gauge data were to be assimilated with radar data, the occurrence of rainfall may be useful even if the magnitude is erroneously recorded.

Validation of automated QC process
The QC rule base aims to exclude erroneous data whilst retaining correct values. Definitive validation of the rule base is impossible to achieve as other sources of national rainfall data may also contain errors. However, some attempt at validation is essential to provide confidence in the resultant dataset. We have therefore validated the rule base in two ways, firstly by comparing the resultant quality-controlled gauge data to a different gridded daily dataset, and secondly by comparing it to known historic storm and flood events recorded in the literature.

Validation of the exclusion of large rainfall values
Daily totals were calculated from the quality-controlled hourly gauges and compared to the corresponding UKCP09 5 km gridded daily Table 1 Summary of automated quality control flags applied to all data and described in the main text (QC process step 2). Suspect daily accumulations at 0900 or 1200 flagged where a recorded rainfall amount at these times is preceded by 23 h with no rain. A threshold of 2× the mean wet day amount for the corresponding month is applied to increase the chance of identifying accumulated values at the expense of genuine, moderate events: Accumulation at 0900 1 Accumulation at 1200 2 QC5 (Non-threshold) Suspect consecutive daily accumulations at 0900 or 1200 flagged recorded rainfall amounts at these times are preceded by 23 h with no rain on consecutive days with no threshold to the wet hour amount applied. Accumulation at 0900 1 Accumulation at 1200 2 QC6 (Non-threshold) Suspect monthly accumulations. Identified where only one hourly value is reported over a period of a month and that value exceeds the mean wet hour amount for the corresponding month (a lower threshold than in QC4 is used here as a dry month is much more unlikely than a dry day in GB).  (Perry et al., 2009). We used the UKCP09 dataset for validation as the CEH-GEAR dataset was used in the QC process to flag suspicious data (QC9). Although UKCP09 and CEH-GEAR use different interpolation methods, they are still highly correlated. Indeed, when the 24 h accumulation for each CEH-GEAR 1 km time series is compared with its corresponding 5 km grid square, the Spearman's rank and Pearson correlation coefficients range from 0.9 to 1. As the main applications of the gridded hourly data likely relate to the occurrence and intensity of extreme events, this dataset was used to validate the rule base for days which contained very high wet values (wet hour Q99 of the original hourly records) as well as for dry spells longer than 20 days. Fig. 3 shows the different 24 h Q99 event types highlighted by the validation process. Event type A is where a 24 h Q99 value is found in the gauge data and a wet event is found in the gridded data. As the two values coincide, it is likely that the event did occur. If the rule base excludes this value, we consider it 'incorrect'. Event type B occurs when there is high rainfall in the gauge data but not in the gridded dataset. For this type of event, if the rule base excludes this high value, we consider it 'correct' under the assumption that the gridded dataset is a reliable benchmark. Event type C occurs when there is a relatively high rainfall value observed in the gridded data but not in the gauge data. This type of event would not be excluded by the rule base as the value in the gauge data is low. In order to make this 'high-low' assessment of the difference between the two datasets we consider similar events as those where the percentage difference is less than 65% (event type B in Fig. 3, plotting around the 1:1 line on Fig. 4). 65% was chosen as the majority of differences are smaller than this. Conversely, if the difference is large, i.e. > 65%, we would expect that those values should frequently be excluded by the rule base.
In Fig. 4 we compare daily total rainfall (0900 to 0900) encompassing each hourly Q99 event from the hourly gauge records, to the corresponding day and location in the UKCP09 daily dataset. This shows that as would be hoped, non-excluded points generally cluster around the 1:1 line where the two datasets are in good agreement, whilst a large number of excluded points are characterised by lower correspondence between the two datasets, indicating that generally the rule base is working as intended. In total, 24.3% of type B events were excluded by the rule base in contrast with only 2.0% of type A events.
The largest excluded values are typically eliminated by rules containing threshold checks (as categorised in Tables 1 and 2). A line of events on the ∼6:1 line in Fig. 4 are excluded by non-threshold checks which identify unexplained scaling in the magnitude of the hourly record, typically after a period of no data (see Fig. 5) and are likely related to some undocumented gauge malfunction, although this feature did require some manual checking to identify the nature of the problem (QC9). It may be the case that standard statistical tests for break points (e.g. Buishand, 1982;Pettitt, 1979) could instead be used to identify such errors but these also have a number of limitations of their own (Serinaldi et al., 2018). It is also noticeable that a large number of excluded events do lie around the 1:1 line, i.e. potentially 'incorrectly excluded' as their daily totals approximately agree. However, further examination indicates that these are typically excluded by rules that apply the daily accumulation flag (QC4), which means that although the daily totals are in broad agreement, the storm shape is wrong in the hourly recordtypically as a result of recording the total as a 0900 accumulation, resulting in their exclusion.
Validation by comparison to the daily gridded dataset is a useful method but it is not without limitations, particularly with regard to the lack of commensurability in the spatial representation of the two data sources. No metadata on the locations of the gauges is provided with the UKCP09 5 km gridded daily rainfall dataset, so we may be comparing the hourly data to an interpolated grid square value. We therefore have to be particularly careful when validating in this way as extreme sub-daily rainfall events can be of short duration and limited in their spatial extent and may not be captured in an interpolated grid Table 2 Automated rule base definition (QC process step 3) constructed on individual and combined QC flags and based on knowledge of known rainfall processes and errors. If the criteria of any of the rules is fulfilled, the suspect value(s) are treated as erroneous and removed from the record and replaced with a missing data value (−999). QC flags which are applied in step 2 are defined in Table 1 square, or may be smoothed out by interpolation. This may justify why many type B events were not excluded.
To further check that the rule base was not excluding type A events, we examined the correspondence of our data with known, high-intensity historic events. We took 16 historic extreme rainfall events recorded in 'Weather' journal articles and a further 27 from a chronology of severe UK weather events between 1901 and 2008 (Eden, 2008). The location, date, and duration of each event was identified from the literature (see Table S2) and compared to the highest recorded hourly value 24 h either side of the event at the nearest recording gauge. For 12 of these events, no rainfall was recorded in the nearest gauge and for Fig. 3. Schematic of hypothetical scenarios of different large value event types highlighted by the comparison of the quality controlled hourly data (aggregated to daily) to the gridded daily data from the UKCP09 5 km gridded daily dataset as part of the validation process. Event type A is where a Q99 value is found in the gauge data and a wet event is found in the gridded data. Event type B describes high rainfall in the gauge data and low rainfall in the gridded dataset. Event type C describes a relatively high rainfall value observed in the gridded data but not in the gauge data. Fig. 4. Comparison of 24 h rainfall totals for hourly Q99 wet hour events (hourly gauge record) with corresponding totals from the daily gridded record. Events are marked as either excluded or not excluded by the rule base. The excluded events are distinguished as either a consequence of rules comprising threshold checks (including neighbourhood checks), non threshold checks or both (See Table 1). The Q99 range of 24 h totals from the hourly record is 6.1-6157 mm.  5. Example of inter-dataset comparison of accumulated daily rainfall totals for hourly data at Cowbridge, South Wales and corresponding daily totals from the CEH-GEAR and UKCP09 datasets to identify a non-homogeneous time series. These are generally identified by QC9 and removed by R18 (described in Tables 1 and 2).
3, a period of missing data was recorded in the nearest gauge. Of the remaining 28, no events were excluded by the rule base which provides some confidence that it is not excluding real extreme events. This method of validation is somewhat limited as the storm centre and its extent are not reported and so the extreme events may not necessarily show in the nearest hourly gauge record. We also acknowledge that this is a relatively small sample of events compared to the number of gauges but it does provide some confidence that the QC process is not excluding genuine extreme events that were frequently associated with significant impacts.

Validation of the exclusion of dry spells
Long dry periods are another common error in the hourly gauge records. In particular, long sequences of zero values at the beginning and/or end of records. This may occur when the start or end time of a gauge has been incorrectly reported, its malfunctioning is undocumented, or data values have been incorrectly recorded as zero. Dry spells in the UK are typically defined by a 15-day threshold (Atkinson et al., 1985). For validation of the rule base, we however investigated all sequences of dry days over a relaxed threshold of 20 days or more within the gauge records. For each dry sequence the percentage of wet days and average daily rainfall for the relevant grid cell over the corresponding period in the UKCP09 dataset were calculated. Fig. 6 shows the two types of dry events examined in the validation process. For likely true dry periods (type D events), the % wet days and average daily rainfall in the UKCP09 gridded daily dataset are very low by definition. Type D events are therefore defined as periods that are dry in the gauge data and for which the UKCP09 gridded dataset has a wetday percentage value of ≤20% or an average daily rainfall of ≤1 mm. Type E events (likely erroneous), are defined as periods that are dry in the gauge data but for which the UKCP09 gridded dataset has a wet-day percentage value of > 20% or an average daily rainfall of > 1 mm. In total, 61.4% of type E (erroneous) events were excluded by the rule base, whereas only 3.3% of type D (true) events were excluded when evaluating by wet percentage day. Similarly, 69.3% of type E events were excluded, whilst only 8.0% of type D were when evaluating by average daily rainfall. The rule base therefore seems to be effective as it is excluding a large percentage of erroneous dry spells whilst only excluding a small percentage of real dry spells. Due to the interpolation of daily rainfall values, however, it is more likely to rain in the gridded record than at the gauge and so, even for true dry sequences, some rain might be expected in the UKCP09 gridded dataset. The percentage wet days and average daily rainfall is much lower for non-excluded events than those that are excluded (Fig. 7) demonstrating that the rule base is excluding mainly erroneous dry spells.

Resulting gauge dataset
In total, 3.4% of the hourly data was excluded by the QC process. Fig. 8 shows the improvement over Fig. 2 in r s and percentage correct statistics after the rule base is applied (mean absolute differences are also shown in Fig. S3). Only 2.5% of gauges still have poor correlation and percentage correct statistics (< 0.8) when compared to the CEH-GEAR daily time series. On further investigation, these gauges are characterised by rainfall values of a reasonable order of magnitude but with long periods of missing data suggesting that the gauge may be faulty over a prolonged period. The absence of particularly high values in these gauges means that the QC tests have not flagged data as suspicious, and therefore such potential discrepancies are best identified through comparison to a high quality reference dataset (e.g. the CEH-GEAR gridded dataset, check gauges etc.). Such gauges would be excluded from some climatological analyses as a consequence of the large percentage of missing data though this is not an important factor for the production of the gridded dataset as the subsequent interpolation procedure (Section 6) accounts for missing periods by using the next nearest gauge. However, as a precaution against using such potentially erroneous data, only gauges with r s > 0.8 and P00 + P11 > 0.8 after the QC process were used.
There are a range of rule bases that could be used to determine the exclusion of potentially suspect data flagged by the QC tests outlined in this paper. We therefore examined three other rule bases made up of fewer rules and representing differing levels of 'severity'. Each of these was validated using the process described above (see Supplementary   Fig. 6. Schematic of different dry spell event types highlighted by the comparison of the quality controlled hourly data (aggregated to daily) to the gridded daily data from the UKCP09 5 km gridded daily dataset as part of the validation process. Type D events are defined as periods that are dry in the gauge data and for which the UKCP09 gridded dataset has a wet-day percentage value of ≤20% or an average daily rainfall of ≤1 mm. Type E events (likely erroneous), are defined as periods that are dry in the gauge data but for which the UKCP09 gridded dataset has a wet-day percentage value of > 20% or an average daily rainfall of > 1 mm.
Information, Tables S3 and S4). The selected rule base described in this paper was found to eliminate a relatively large amount of suspect data whilst simultaneously eliminating only a small amount of non-suspect data. It is difficult to quantitatively demonstrate which rule base is 'better' as we do not have a reliable true reference dataset. The rule base presented here representatively codifies the judgments made in manual inspection and was therefore considered to be the most appropriate.

Temporal disaggregation
The quality controlled hourly gauge data was used to disaggregate the CEH-GEAR gridded daily rainfall dataset. There are many different interpolation methods available for gridding rainfall data, such as Thiessen, inverse distance weighting, cubic spline, kriging etc. Many studies have compared the relative benefits of each (Contractor et al., 2015;Dunn et al., 2014;Hofstra et al., 2008;Dirks et al. 1998) and conclude that in areas of high gauge network density, the method selected has little impact. Given the relatively high rainfall gauge network density in the UK, a simple nearest neighbour interpolation without height correction was used to preserve a real storm shape for every grid square. This was considered to be beneficial as it will preserve extreme hourly rainfall intensities whereas other interpolation methods will smooth these extremes out. A limitation of this approach is that convective events can be very small and therefore nearest neighbour may sometimes represent a convective storm over too large an area. However, as it is the hourly rainfall fractions that are interpolated here, the actual rainfall total is modulated by the daily rainfall dataset, which is smoothed, meaning that this effect is reduced. This methodological choice is supported by similar applications in Li et al. (2018), and Choi et al. (2008).
For each day a subset of hourly gauges was selected, using only those where the record for that day was complete. The number of gauges used on each day was therefore variable as some were excluded Fig. 7. For each individual dry spell event in the hourly record longer than 20 days, the corresponding time series is found in the UKCP09 gridded dataset. The percentage of days for which it is raining in the gridded dataset (left) and the mean daily rainfall in the gridded dataset (right) for each period is plotted for events excluded by the rule base and not excluded by the rule base. Fig. 8. Comparison of r s values between each hourly rain gauge (accumulated to 24 h) and the corresponding CEH-GEAR daily time series before and after implementation of the QC process (left), and for P00 + P11 (percentage correct) statistics before and after implementation of QC process (right). because they contained missing data. Therefore, the number of gauges used on a given day ranged from 295 to 1372 (see Fig. 9). For each hour, the gauge data was interpolated onto a 1 km grid using nearest neighbour interpolation that, for each grid square, assigns it the nearest station hourly rainfall value as a fraction of the station's daily total (Isaaks and Srivastava, 1989). If the nearest station was over 50 km away, or if there was rain in the daily dataset but not in the nearest hourly gauge, the station value was not used and an average storm shape was used instead. The gridded hourly fractions were then multiplied by the daily rainfall from the CEH-GEAR dataset. This preserved the daily totals from CEH-GEAR whilst maintaining the nearest recorded storm shape. Fig. 10 shows the average storm shape given a daily rainfall total range, which changes according to season, and was constructed from all the available gauges (see Fig. S2 for the full distribution of storm profiles). In winter (November-April) the storms are typically longer in duration and less intense whilst in summer (May-October) they are shorter and more intense. This 'average storm' is set to begin in the gridded dataset at 0900 whenever it is applied. This represents a more realistic approach to disaggregation than has been used elsewhere, when accumulated rainfall totals have been distributed equally across the 24 h period (e.g. Parkes et al., 2013) and could potentially be improved upon by setting the peak of the design storm to coincide with a seasonal likely wettest hour. The design storms could also be calculated regionally, or alternatively a weather generator could be used to infill missing data.

Reliability of the gridded hourly dataset
The reliability of CEH-GEAR1hr is dependent upon that of the daily totals from CEH-GEAR (see Keller et al. 2015), and of the hourly disaggregations. The disaggregation error has several components:  measurement error of the gauge record, error associated with the distance to the nearest gauge and the error associated with using statistical disaggregation. The measurement error has been reduced as much as possible through QC; however, some errors will remain in the gauge data, particularly those associated with wind-induced under-catch and evaporation errors. The metadata associated with the dataset includes the distance to the gauge used for each grid square for every day. The mean distance to a gauge over the whole period is 11.3 km and the maximum distance is 97.7 km for the west coast of Scotland. Most of the country is reasonably well covered by rain gauges, with the exception of Scotland and the south west of England, which is reflected in these areas having a higher average distance to gauge (Fig. 11). In the case of the latter, this is likely a result of the later instrumentation with gauges by the Environment Agency in south west England compared with other parts of the country. Fig. 12 shows the increasing error with distance between the temporal patterns (the fraction of daily rainfall falling in each hour) of a gauge and its nearest neighbour. This information, together with the distance to the gauge used for disaggregation provided with the hourly rainfall grids, gives users the information to decide if parts of the CEH-GEAR1hr estimates are suitable for their needs. The low gauge density in some areas also means that errors are likely to occur from under-sampling of orographically enhanced rainfall and localised convective storms.
The metadata also describes whether or not statistical disaggregation was used for each day. As noted above, where a grid square is greater than 50 km from a gauge, or when there is rainfall in the daily record but not in the hourly record, statistical disaggregation is used. Statistical disaggregation due to zero rainfall in the hourly record is used for 0-26% of grid squares on any given day over the whole 1990-2014 period, whereas statistical disaggregation due to the grid square being over 50 km away from a gauge was used for 0-3% of grid squares over this time period. Instances of the latter use of statistical disaggregation are rare after 2000 when the final regions of the EA network were gauged and so the average distance between gauges in data sparse areas decreased. Whilst the percentage of grid squares using statistical disaggregation due to zero rainfall in the hourly record fluctuates greatly over the record, 97% of the time this is used to disaggregate a daily total of less than 1 mm of rainfall. The statistical disaggregation is therefore unlikely to have a large impact on extreme values in the dataset or on subsequent hydrological model simulations.

Evaluation of the hourly gridded dataset
CEH-GEAR1hr has statistical properties of interest that are generally consistent with previous gauge-derived national scale estimates   (Blenkinsop et al., 2017). Throughout most of the year the median of the seasonal maximum 1hr rainfall (Rmax) is highest in the mountainous regions and the west of Britain (Fig. 13) whilst in the east and lowland areas, Rmax is highest in summer (JJA) and autumn (SON). The spatial pattern of Rmax is also much less coherent and less clearly defined by orogrophy in summer which might be expected from the increased dominance of convective rainfall at this time of year and is consistent with the corresponding results presented in Blenkinsop et al. (2017). Summer also sees the lowest number of wet hours across the country and winter the greatest (Fig. 14), with the influence of topography on this variable evident throughout the year. These patterns are, as expected, broadly consistent with those of daily rainfall occurrence rates (Jenkins et al., 2008). By using both gridded daily data and hourly gauges, the new dataset provides additional spatial detail not present in previous maps. A more detailed evaluation of the dataset is beyond the scope of this paper but will be provided in a subsequent publication.

Discussion and conclusions
A 1 km gridded hourly rainfall dataset has been created for Great Britain using data from over 1900 quality controlled gauges for the period 1990-2014. We extended and automated a quality control procedure (essential for such a large number of gauges) to expand the availability of the hourly rain gauge dataset accumulated by Blenkinsop et al. (2017). We allocated a series of QC flags that identified potentially suspect values and then constructed a set of rules that apply either single or combined QC flags based on knowledge gained from the initial dataset to exclude likely erroneous rainfall amounts, accumulations and dry spells. A two-tiered validation approach at the daily and hourly timescales indicates that spurious extreme values have been excluded from the resultant dataset, while assumed legitimate values have been preserved. Most quality-controlled gauges are found to be highly correlated with the UKCP09 observed 5 km dataset at the daily timescale and show a high degree of concordance in terms of rainfall occurrence. The resulting gauge density used in this product varies over time and space, being particularly sparse in Scotland and the southwest of England prior to 2000, and this should be borne in mind when using the dataset.
A nearest neighbour interpolation scheme was used to provide gridded hourly rainfall values at a resolution of 1 km. These data were then used to temporally disaggregate the existing CEH-GEAR daily gridded dataset to produce an hourly dataset with consistent daily totals. Consistency with the existing CEH-GEAR dataset is an important feature of the new hourly dataset as it will be made freely available, hosted and updated alongside the CEH-GEAR product at http://eidc. ceh.ac.uk.
This new dataset will be a valuable resource for hydrologists, climate scientists and the broader community wishing to assess current exposure to intense rainfall. There are few national datasets available at a sub-daily time-step yet many hydrological applications require, or can be improved by, a higher temporal resolution of rainfall data especially for smaller, rapidly responding, catchments (e.g. Archer et al., 2016). However, we would recommend that the gridded hourly data should not be used for trend analyses due to the short length of the dataset, potential gauge level inhomogeneities and the temporal variation in gauges used in the disaggregation. Kendon et al. (2018) examined UK gauges of at least 13 years duration and subsequently corrected for inhomogeneities arising from changes in measurement resolution. This was found to have a limited effect on extreme values but can have a significant impact on mean intensities and rainfall occurrence statistics. The lack of additional metadata available makes the identification and attribution of other potential inhomogeneities challenging.  If only gauge data is required, the QC procedure presented here can be used to improve the consistency of data by trying to eliminate erroneous extremes whilst maintaining genuine values. The QC framework here will therefore also be implemented in the global effort to gather sub-daily rainfall data currently being undertaken by the INTENSE (INTElligent use of climate models for adaptatioN to non-Stationary hydrological Extremes) project which forms part of the World Climate Research Programme's Global Energy and Water EXchanges (GEWEX) Grand Challenge on Extremes. This has identified the importance of developing new and novel QC methods at different timescales and for different locations (Alexander et al., 2016). A key challenge will be to adapt the tests and rules developed here so that they are applicable to different climate regimes and operating practices.
A by-product from the generation of the gridded dataset is an additional dataset comprising the individual QC flags (Table 1) associated with each hourly value for each gauge. This metadata can be obtained alongside the rule base code to allow users who are licensed to use the original gauge data to apply a bespoke rule base to the gauge data. This creates a community resource where improvements to the database can be made and shared for different types of analyses. One future piece of work could be to explore the uncertainty associated with the implementation of the rule base or, more generally, the quantification of uncertainty within the gridded hourly precipitation product. The importance of adequately assessing uncertainty within model predictions and observations that drive them has been highlighted by a vast array of authors (e.g. Beven, 2002;Pappenberger et al., 2006Pappenberger et al., , 2008Di Baldassarre et al., 2010;McMillan et al., 2012). Within the field of hydrology, the rainfall data used to force a model prediction is a key source of error, and therefore, the generation of a probabilistic rainfall product would be a valuable next step (Ahrens and Jaun 2007;Sideris et al. 2014). Corresponding products are currently being developed by the reanalysis community (Bach et al., 2016). Some key prerequisites for generating a probabilistic rainfall product would be to gain a better understanding of the uncertainties associated with the hourly gauge data (for example undercatch), the QC process and the interpolation method, but ultimately this would also need to be explored at the daily level in the original CEH-GEAR product.
Further potential exists for novel methods to create merged products that take advantage of the spatial detail offered by other sources of data such as radar. Jewell and Gaussiat (2015) examined several schemes to merge radar with TBR data over England and Wales, all of which produced a merged product that was superior to the individual data sources. Such products have significant potential to support flood forecasting and may provide improved calibration of hydrological models (Parkes et al., 2013). As a next step, we intend to provide added value to the dataset by conducting an assessment of its key characteristics with respect to flooding from intense rainfall and developing freely available gridded datasets of useful indices such as extreme percentiles and peaks over threshold as well as the provision of Intensity-Duration-Frequency curves. We will also evaluate the benefit of using the hourly gridded dataset in hydrological modelling applications.

Declarations of interest
None.