The International Surface Pressure Databank version 2

The International Surface Pressure Databank (ISPD) is the world's largest collection of global surface and sea‐level pressure observations. It was developed by extracting observations from established international archives, through international cooperation with data recovery facilitated by the Atmospheric Circulation Reconstructions over the Earth (ACRE) initiative, and directly by contributing universities, organizations, and countries. The dataset period is currently 1768–2012 and consists of three data components: observations from land stations, marine observing systems, and tropical cyclone best track pressure reports. Version 2 of the ISPD (ISPDv2) was created to be observational input for the Twentieth Century Reanalysis Project (20CR) and contains the quality control and assimilation feedback metadata from the 20CR. Since then, it has been used for various general climate and weather studies, and an updated version 3 (ISPDv3) has been used in the ERA‐20C reanalysis in connection with the European Reanalysis of Global Climate Observations project (ERA‐CLIM). The focus of this paper is on the ISPDv2 and the inclusion of the 20CR feedback metadata. The Research Data Archive at the National Center for Atmospheric Research provides data collection and access for the ISPDv2, and will provide access to future versions.


Introduction
The International Surface Pressure Databank (ISPD) is the world's largest collection of global surface and sea-level pressure observations. Since its inception in 2002, its development has been facilitated by cooperative efforts between the Global Climate Observing System (GCOS)/World Climate Research Program (WCRP) Working Group on Observational Data Sets for Reanalysis [2007Reanalysis [ to 2011, and the ongoing efforts of the GCOS Working Group on Surface Pressure and the international Atmospheric Circulation Reconstructions over the Earth initiative (ACRE, Allan et al., 2011). The full observational record is extracted and assembled from established international archives and newly available collections from more than 60 different contributing organizations, shown in Tables 1 and 2.
The ISPD version 2 (ISPDv2) covers the period 1768-2012. It merges data from three input components: observations from land stations, marine observing systems, and tropical cyclone best track pressure reports. The land station component was extracted from many national and international collections of sea-level pressure and surface pressure. The largest contributor to this component was the Integrated Surface Database , which consists of global hourly and synoptic surface observations collected from many sources. The land stations were merged following a two-step algorithm to first remove duplicates within a collection and then remove duplicates between collections (Yin et al., 2008).
The marine component consists of sea-level pressure observations extracted from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS, Woodruff et al., 1998;Parker et al., 2004;Woodruff et al., 2005;Worley et al., 2005;Woodruff et al., 2011), a global ocean marine meteorological and surface ocean dataset that is considered the most complete of its kind. It is comprised of measurements and observations extracted from many international data sources, including ship reports, moored and drifting buoys, coastal stations, and other marine platforms. ICOADS release 2.4 was used in the ISPDv2 for the period 1952-2008 and ICOADS release 2.5  was used for the period 1785-1951 and 2009-2012. The tropical cyclone component was taken from the International Best Track Archive for Climate Stewardship (IBTrACS; Knapp et al., 2010). The IBTrACS dataset consists of global tropical cyclone best track position and intensity observations, and reports collected from each of the World Meteorological Organization (WMO) Regional Specialized Meteorological Centers, Tropical Cyclone Warning Centers, and other national agencies. The IBTrACS Beta version was used for the years 1952-2006, version v01r01 for 1871-1883 and 1886-1951, version v02r01 for 1884-1885 and 2007-2008, version v03r02 for 2009-2010, and version v03r05 for 2011-2012. The inclusion of various versions of both ICOADS and IBTrACS in ISPDv2 was the result of using the most up-to-date data that were available during the assimilation into the Twentieth Century Reanalysis (20CR; Compo et al., 2011). This rationale is elaborated upon in Section 2.1.  The columns denote, respectively, the ISPDv2 collection ID, collection name, description, yearly period of record, total number of stations and observational records in the collection, reference number for the corresponding National Climatic Data Center (NCDC) dataset, reference number for the corresponding dataset archived in the Research Data Archive (RDA) at the National Center for Atmospheric Research (NCAR), and reference publication. *The Air Weather Service TD13 observational records originating from collection ID 1004 are incorrectly labeled with collection ID 2001 in the native HDF5 tables in both ISPDv2 and ISPD version 3.
The international surface pressure databank

Dataset content and coverage
The ISPDv2 dataset is stored in the Hierarchical Data Format version 5 (HDF5: http://www.hdfgroup.org), which allows for efficient archiving of large, complex data structures and a diverse set of metadata. Each HDF5 file contains 13 tables and 15 directory subgroups that define how to decode the individual records, and each observation record includes metadata defining the data collection source and observation type (e.g. station observation, marine observation, surface pressure observation from a radiosonde sounding).
Because the ISPDv2 was assimilated as observational input into the 20CR, each record also includes the 20CR quality control and assimilation feedback metadata. These quality control and feedback metadata include the results of the five-step quality control procedure described in Appendix B of Compo et al. (2011) and all the relevant statistical quantities returned by the 20CR ensemble data assimilation system. The format of the HDF5 metadata, therefore, is designed to allow traceability of observations from their original source to the ISPDv2, and to permit direct feedback from the 20CR data assimilation back to the original source archives.
Each observation at any single observation time is assigned a unique number that when combined with the time stamp (year, month, date, hour, and minute) forms a unique identifier within the complete dataset. Thus, every observation has an unambiguous reference within the full ISPDv2 dataset, which allows for precise identification and traceability. This data management feature is critical for accurately adding additional metadata from other future reanalyses, and evolving the ISPDv2 to new versions while maintaining the usage provenance of each record.
The 20CR required hourly files from the ISPDv2 data for the data assimilation system, thus each data file in the ISPDv2 collection contains one hour of global observations. Based on this file organization and data coverage, a maximum of 24 data files per day and 8760 data files per (non-leap) year exist in the ISPDv2 collection. The total dataset volume is 491.32 GB. The HDF5 data are available for download as either individual hourly data files, or as yearly tar files, which are tarballs of one entire year of HDF5 data files.
The ISPDv2 is also available in NetCDF and ASCII column data formats, both of which are derived from the native HDF5 format. The data content within these two formats is not a complete reproduction of the tabular table content and metadata that are contained within the HDF5 data. It does contain, however, essential information, including the 20CR quality control feedback information, for most users to carry out their research projects. The ASCII data are available as monthly tar files for direct download from the National Center for Atmospheric Research (NCAR) Research Data Archive (RDA) website (http://dx.doi.org/ 10.5065/D6SQ8XDW), while the NetCDF data are produced on demand through the data subsetting service described in Section 3.1.
The data availability varies considerably over the period of record. Maps of annual land station distribution ( Figure 1) highlight the steady growth in the number of reporting stations, and hence data coverage, since the year 1850 (station coverage maps for every year are available at http://www.esrl.noaa.gov/ psd/data/ISPD/v2.0/). The sources of land station observations in the 19th and early 20th centuries are located primarily in Europe and North America. The data coverage increases dramatically across most of the rest of the globe from 1950 to the present. Figure 2 shows time series of the total number of land observing stations in each indicated continental region by year from 1850 to 2010. The notable decline in stations during the mid-1960sprimarily in the Asia/Eastern Europe time seriesis attributed to the reduction in the Integrated Surface Database records during this period, and is explained in Smith et al. (2011) as the result of the transition from keying of observational data to digital transmission and receipt of data.
The Asia time series also exhibits a considerable decline in station coverage starting in the mid-1980s. A listing of the number of stations originating from the Integrated Surface Database for the years 1963-1964 and selected years between 1980 and 2001 is shown in Table 3, which reveals these reductions are primarily associated with the observations originating from the Russian Federation and other former Soviet Union countries. These reductions can be explained by the deterioration of funding of the meteorological observation network in the final years of the Soviet Union existence and thereafter in the 1990s and 2000s. The observation network health during this post-Soviet period was a function of the economic situation in each of the 15 newly independent nations (and, to some extent, a function of foreign sponsorship). The China, North and Central America, and South-West Pacific regions also exhibit notable decreases between 1963 and 1964.
The average number of pressure observations per day contained in the 20CR feedback records from all ISPDv2 components (station, marine, and tropical cyclone) is illustrated by maps for the years 1900, 1950, and 2000 in Figure 3. The preponderance of observations in the Northern Hemisphere in earlier years is evident, as is the growth of marine observations over time (see Woodruff et al., 2011 for a complete description).
The total number of observations per year in the ISPDv2 increases steadily from approximately 100 000 records in 1870 to 53 million in 2010 (Figure 4). A local increase and subsequent decrease in this trend during the 1878-1894 period is attributed to the addition of 1.8 million records from the US Marine Meteorological Journals and digitized for ICOADS Release 2.5 . A decrease during the World War I years is evident, while an increase during the World War II years is attributed to the inclusion of 1.6 million ICOADS records digitized from the UK Royal Navy Ships' logbooks (Brohan et al., 2009;Woodruff et al., 2011). A decrease during the mid-1960s is associated with the aforementioned decline in station records in the Integrated Surface Database . A decrease in 2007 is most likely due to a delay in processing, and therefore incomplete records, in the Integrated Surface Database. Other small-scale variations in the curve likely can be correlated with increased scientific interest (local peaks) and research budget reductions (local troughs; Fleming, 2000).
The dataset contains reports from more than 10 000 land stations, which, in addition to the marine observing systems reports, comprise over 1.5 billion observations in total. Between 1860 and 1917, the total number of marine and surface station observations is comparable, but in later years the total count is dominated by land surface station reports (not shown). This prevalence of land station observations is primarily due to the input data compiled for the National Centers for Environmental Prediction (NCEP)/ NCAR Reanalysis project (1948-2003Kalnay et al., 1996;Kistler et al., 2001).

Reanalysis
The primary motivation for the creation and development of the ISPD was to facilitate progress on research and understanding of long-term trends and variations in global surface pressure. In particular, a long-term historical dataset, such as the ISPD, becomes essential to providing an observational underpinning to retrospective climate analysis datasets, commonly known as reanalyses. Reanalysis products are used extensively in climate research, applications, and services, including for monitoring and comparing current climate condi-  0  47  28  877  1007  925  927  869  1058  1  184  166  717  813  740  748  777  935  2  914  223  800  695  683  577  410  482  3  795  124  1031  878  835  751  607  584  4  466  353  916  932  864  876  867  1049  5  518  347  727  598  592  587  558  638  6  211  222  689  768  758  776  825  887  7  484  280  1524  1594  1671  1697  1560  1770  8  132  132  550  695  661  684  630  671  9  689  626  578  865  874  898  860  841 The columns represent the station count for the years listed and organized by WMO regional block number (0 and 1 = Europe; 2 and 3 = Russian Federation and other former Soviet Union countries; 4 = Asia; 5 = China; 6 = Africa; 7 = North America, Central America, and the Caribbean; 8 = South America and Antarctica; and 9 = South-West Pacific).  tions with those of the past, identifying the causes of climate variations and change, and preparing climate predictions. Information derived from reanalyses is also being used increasingly in commercial and business applications in sectors such as energy, agriculture, water resources, and insurance. Prior to completion of the ISPD, most reanalysis products (including those from the NCEP/NCAR Reanalysis Project (Kalnay et al., 1996;Kistler et al., 2001) and the European Centre for Medium-Range Weather Forecasts [ECMWF; Uppala et al., 2005]) could only extend back to about 1950 since they relied on assimilating a set of observations that included upper-level atmospheric information. The primary input data source for the NCEP/NCAR Reanalysis, for example, are the global rawinsonde observations, which contain a period of record substantial enough to produce that reanalysis back to 1948. The limited time range of these reanalyses restricts their usefulness for many climate research applications. The ISPD adds value by providing a much longer period of record and thus enables the development of reanalyses with longer time spans.
The first reanalysis to make use of the ISPD is the 20CR Project , which assimilates only the ISPDv2 surface and sea-level pressure observations and prescribes observed monthly seasurface temperature and sea-ice distributions from HadISST1.1 (Rayner et al., 2003) as boundary conditions. This global reanalysis dataset spans the late 19th century, the entire 20th century, and the early 21st century . It estimates the state of the atmospherethe 'analysis'by combining the hourly and synoptic ISPDv2 observations in a 6-h time window with a dynamically generated 9-h first guess forecast initialized from the previous analysis. Cycling this procedure with overlapping 6-h analyses and 9-h forecasts has the effect of spreading the observational information from the ISPD three-dimensionally in space and in time.
As the 20CR was produced, the early period ISPDv2 'beta' version was revised to include newly digitized additions and improve the overall coverage of stations. As ICOADS Release 2.5 was then available, it was also included to increase the number of marine observations in the early period. Similarly for IBTrACS, newly digitized records from the most recent versions of IBTrACS were added to ISPDv2 as it evolved and improved over time. The final ISPDv2, therefore, contains multiple source versions of ICOADS and IBTrACS. The newer versions of ICOADS and IBTrACS did not change appreciably except to expand their data inputs and coverage, therefore we do not anticipate that the use of multiple versions introduces inconsistencies in the pressure dataset.

Twentieth century reanalysis quality control and data assimilation feedback
During the data assimilation processing of the 20CR, the ISPDv2 input data were interrogated by a five-step quality control procedure that included, among other steps, checking the observations for meteorological plausibility, comparing them with neighbouring data values, and performing a bias correction on the land component (see Compo et al., 2011 for more information on this procedure). The 20CR then applied an Ensemble Kalman Filter data assimilation method which used background first-guess fields supplied by an ensemble of 56 forecasts from an experimental version of the NCEP Global Forecast System (GFS) numerical weather prediction model (Kanamitsu et al., 1991;Moorthi et al., 2001;Saha et al., 2006). These quality control and data assimilation results, which include the first-guess fields, analysis departures, bias estimates, and observation errors, were then written back into the ISPDv2 so that each observational record contains this feedback information. Future users of the ISPDv2, and those who contributed the original observational data, therefore can utilize this information to make an informed decision on the quality and usefulness of each observational record during the 20CR time period.
One project that utilized the 20CR feedback metadata is the ERA-20C reanalysis (Poli et al., 2013), which is the first reanalysis produced under the European Reanalysis of Global Climate Observations database project (ERA-CLIM: www.era-clim.eu). The ERA-20C is a global reanalysis that spans the 20th century (1900-2010) assimilating only surface and sealevel pressure observations from ISPD and ICOADS and marine surface winds from ICOADS. Feedback information from the 20CR was used in the bias correction scheme for ISPD data that were assimilated in the ERA-20C reanalysis; this was based on a break-point analysis using the 20CR first-guess departures (Hersbach et al., 2015). In locations where breakpoints were suspected (e.g. instances of instrument errors or a change in station location), the ERA-20C bias correction scheme assigned less confidence to the 20CR first-guess value, and thus was allowed to be more adaptive compared to cases where no irregularities in long-term departures were identified.
In another study, Wang et al. (2014) used the 20CR feedback metadata to show that the strong extratropical cyclone events in the 20CR agree well with the geostrophic wind extremes derived from in situ pressure observations. This study illustrates how the 20CR quality control procedure performed well in identifying errors in the observational record. For example, the 20CR quality control system identified and rejected 143 of 146 erroneous observation values from the Aberdeen, Scotland records for the period 1871-1921. This quality control feedback information, therefore, is valuable in identifying observation errors which might otherwise be overlooked and, coupled with continued data rescue efforts such as those facilitated by ACRE , serves to strengthen and reduce uncertainty in the observational record.

Research data archive data location and accessibility
The primary repository for ISPDv2 is in the RDA (http://rda.ucar.edu) at NCAR in Boulder, Colorado. The RDA is a free and open data collection where data discovery can be achieved through faceted searches based on Global Change Master Directory (GCMD: http://gcmd.nasa.gov) metadata keywords, free text queries, and lists highlighting the most used datasets. Data access is free, but requires each user to register through a simple online process that validates the submitted email address.
The ISPDv2 data access is organized by year and month on the RDA website (http://dx.doi.org/ 10.5065/D6SQ8XDW), and users may browse through the pages to locate specific data of interest. Data files are organized in default groups by observation year and month. They can be downloaded directly from the RDA web interface using server-supplied 'wget' scripts. Users may select a collection of files to download from default lists by using the 'Web File Listing' option under the 'Data Access' tab.
A more refined option is provided through the data subset request form, which may be accessed via the 'Get a subset' link under the 'Data Access' tab. This produces a customized data subset of the HDF5 data based on user-provided constraints. Users can specify the temporal limits, spatial domain, observation type (s), and ASCII or NetCDF data output formats through the data subset request form ( Figure 5). The output file compression can also be requested on this form (not shown). Observations from individual observing stations may also be requested, which is a useful feature for users who wish to procure time series for specific locations or regions. Once submitted, the data subset requests are produced by a delayed mode data processing procedure. Users are notified and directed to a web download location when their data request output files are accessible.

Dataset citation
The RDA also provides a data citation service. The dataset homepage provides citation syntax in the standard forms recommended by the Federation of Earth System Information Partners (ESIP), American Meteorological Society (AMS), American Geophysical Union (AGU), DataCite, and the Geoscience Data Journal. This dataset citation is also available in Research Information System (RIS) format so that users may import the citation for this dataset directly into their citation reference management software (e.g. End-Note, Zotero, etc.). One key element in the dataset citation is the data access date ('Accessed dd mmm yyyy'). Leveraging the fact that RDA data users are registered, a customized data citation can be prepared for the user by using the 'Get a customized data citation' link on the dataset home page. The dates that users access the data are recorded and can be retrieved on demand at later times.

File content metadata
File content metadata are collected as part of the data processing and preparation of the ISPDv2 for the RDA. The metadata supports additional information services for interested users. The overall ISPDv2 metadata summary is provided from the dataset home page under the 'More Details' link. Here, tabulations of the observing station location, observation type, platform identification, and maps of the global distribution for the full ISPDv2 can be viewed ( Figure 6).
In addition, users may view the content metadata for the yearly tar file archives by clicking on a looking glass icon adjacent to the tar file listing, accessible under the 'Data Access' tab. This service enables users to query the ISPDv2 via an interactive map of land station locations, and more efficiently determine the observational data available at any particular time and region of interest (Figure 7).

Supporting software and documentation
Source code developed at the University of Colorado CIRES and NOAA Earth System Research Laboratory and written in C language to read and decode the HDF5 data into ASCII column output is provided on the RDA website under the 'Software' tab on the dataset home page. Documentation describing the HDF5 tables and ASCII output is also available for download under the 'Documentation' tab on the dataset home page.

ECMWF data location and accessibility
A subset of the ISPDv2 is also archived at ECMWF (http://apps.ecmwf.int/datasets). Here, users may extract temporal subsets from the ISPDv2 archive stored at ECMWF and refine their selection by observation platform. The output from these requests is written in ASCII column format and contains the date, location, observed value, report type, unique identifier which allows traceability to other ISPD versions, and several feedback parameters from the 20CR, including the 20CR bias estimate, the 20CR ensemble mean first guess pressure minus the 'modified' observed pressure, and the ensemble mean analysis pressure minus the 'modified' observed pressure. The 'modified' pressure is the pressure value after the 20CR system adjusted it to be consistent with the orography in the assimilating model. This is similar to a reduction to sea level but instead is an adjustment to the 20CR orography, which could be higher or lower than the station elevation. In addition to the ASCII output, users may produce observation count maps for their data selection on the ECMWF web interface.

Future plans
Since the release of ISPDv2, over 22 additional organizations have contributed to the station and marine components of the ISPD (Table 1) and the period of record now extends back to 1755. These new contributions have been incorporated into a newer version 3 of the ISPD (ISPDv3), which will be made available in the near future through the NCAR RDA. The observation feedback archive (OFA) version of this dataset as used in the ERA-20C reanalysis is available from ECMWF (Hersbach et al., 2015).
The gain in observations for ISPDv3 is illustrated by the time series shown in Figure 8. Most of the increase occurs in the Northern Hemisphere, and there are visible increases in the late 19th century and during the World War I years. The 2007 decrease in merged records from the Integrated Surface Database into ISPDv2 (cf. Figure 4) is recovered in ISPDv3. These and increases in other years come from diverse station and marine collections contributed to the ISPD under the auspices of GCOS, WCRP, and the ACRE initiative. This includes the initial efforts of the Old-Weather.org citizen science project that focused on UK Figure 5. The data subsetting interface found on the ISPDv2 Research Data Archive webpage after user authentication. Users choose temporal range, spatial selection (choices are interactive map functions, manual latitude/longitude entry, or station identifier entry), and also can select one or many observation types, and ASCII or NetCDF output data formats.  Brunet et al., 2014). The merging of these contributions results in increases that are particularly dramatic when viewed as a map, as shown for the example year of 1918 ( Figure 9). The increases over eastern South America, eastern Africa, China, New Zealand, and selected central Pacific islands are especially evident.
ISPDv3 (specifically, version 3.2.6) was implemented into the European Reanalysis of Global Climate Observations database (ERA-CLIM: www.era-clim.eu), with a goal of using these and additional observations in an ECMWF data assimilation system to generate new global climate reanalyses of the 20th century. The ERA-20C global reanalysis, the first reanalysis under this project, spans the 20th century  assimilating only surface and sea-level pressure observations from ISPDv3 and marine wind observations from ICOADS 2.5 using a 4D-Var assimilation system (Poli et al., 2013).
In addition, ISPD version 3.2.9 is being assimilated into the next generation of the 20CR (version 2c). 20CRv2c uses the same NCEP atmosphere/land coupled model at the same spectral resolution of total wavenumber 62 (about 200 km by 200 km) as 20CRv2 but has different prescribed boundary conditions. Daily sea surface temperatures come from a new Simple Ocean Data Assimilation with sparse input version 2 (SODAsi.2, B.S. Giese, H.F. Seidel, G.P. Compo and P.D. Sardeshmukh, in review). Prescribed monthly averaged sea ice concentrations from the new COBE-SST2 (Hirahara et al., 2014) correct the known misspecification in 20CRv2 Br € onnimann et al., 2012). This activity is currently underway and will be reported on in due course.
The ISPD also will be used in the planned centennial coupled reanalysis being developed under the Japa- Comments and questions on the ISPD can be directed to the RDA data specialist or left at the reanalyses.org webpage for this dataset (http://reanalyses.org/ observations/international-surface-pressure-databank). vided data to EMULATE. (6) Ingeborg Auer of Zentralanstalt f € ur Meteorologie und Geodynamik (ZAMG) who originally provided Austrian data to EMULATE. (7) Clive Wilkinson of NCDC and Climatic Research Unit, University of East Anglia, Norwich, UK, and Eric Freeman of NCDC for World War II UK Royal Navy Data.    Figure 3). Counts of observations are made in five-degree grid boxes.