The PRIMAP-hist national historical emissions time series

To assess the history of greenhouse gas emissions and individual countries’ contributions to emissions and climate change, detailed historical data is needed. We combine several published datasets to create a comprehensive set of emission pathways of each country and Kyoto gas covering the years 1850 to 2014 ::: with :::::: yearly :::::: values : for all UNFCCC member states as well as most non-UNFCCC territories. The sectoral resolution is that of the main IPCC 1996 categories. Additional subsectors are available for time series of CO2 from energy and industry. Country resolved data is combined 5 from different sources and supplemented using ::: year :: to :::: year : growth rates from region resolved sources and numerical extrapolations to complete the dataset. Regional deforestation emissions are downscaled to country level using estimates of the deforested area obtained from potential vegetation and simulations of agricultural land. In this paper, we discuss the data sources and methods used and present the resulting dataset including its limitations and uncertainties. The dataset is available from http://doi.org/10.5880/PIK.2016.003 and can be viewed on the website accompanying this paper (www.pik10 potsdam.de/primap-live/primap-hist/).


5
As this dataset is designed to be used for global climate policy analysis, we provide data for all 196 member states of the UNFCCC as well as several countries and territories that are not UNFCCC members, not internationally recognized, or associated with a UNFCCC member state but not included in the emissions reporting of that state. We follow the territorial coverage of the countries' submissions to the UNFCCC and use territorial accounting : , which is in line with UNFCCC standards.
Territorial accounting attributes emissions originating from a certain territory at any point in time to the state the territory 10 currently belongs to. Emissions of former colonies are thus attributed to the now independent state and not to the former metropolitan state. Occupation of countries' territories is only taken into account if the occupying country reports the emissions from that territory. 6 In the supplementary information :::::: Section ::: 4.3 : we present a list of territories included in the emissions of UNFCCC Parties as well as information on the territories that are treated separately (Section 4.3) ::: and :::: how :: we :::: deal :::: with ::::::: missing ::: data :::: and :::::::: territorial ::::::: changes. 15 The paper is organized as follows: we begin by describing the individual data sources we use in Section 2 and their prioritization in Section 3. In Section 4 we describe how the dataset is constructed from the individual sources including the special treatment of land use data. In Section 5 we give information on how to obtain and use the data. Results are described in Section 6 with information on the uncertainties of emissions data in Section 7. Limitations are covered in Section 8. Methodological details , sector coverage, territorial definitions and data sources : , which we did not use are described in the Appendix. 20 2 Data sources In this section we describe the data sources used to create our composite source. We only use sources : , which are publicly available and prefer sources that are not composites of other sources to avoid including original sources twice, once directly and once indirectly through a composite source. However, it is likely that some sources share at least some input data such as information on fossil fuel production or use the same emission factors. The sources are grouped into four categories. Country 25 reported data is the highest priority category as it can benefit from detailed knowledge about the specific situation in a country and is well accepted in the context of the UNFCCC negotiations. This is exemplified by the linking of the entry into force of the Paris agreement to the latest country reported emissions and not to any third party source (UNFCCC (2015b)). Where this data is not available or does not meet our minimum requirements (see Section 2.1 below) we use country resolved data provided by third parties like research institutions and international organizations. To extrapolate data into the past we use 30 region resolved datasets. Finally, we use some gridded datasets and calculate country resolved data using country masks. Figure 1 gives an overview of the data sources described in detail in the remainder of this section. Detailed information on data preprocessing is available in Section :::::::: Appendix B. all countries global data regional data Annex I countr. non-Annex I countr. In the text we refer to data sources using the acronyms introduced in the source description below.

Minimum requirements for data
To be useful for our composite source, data has to meet some minimal requirements. Emissions data has inherent fluctuations due to weather (determining heating requirements), economic activity, and other factors. Not all sources model all these factors equally and therefore exhibit different fluctuations. When combining the sources, we use the ::: year ::: by :::: year growth rates from the lower priority source to extend a higher priority source ::: (for :::::: details ::: see :::::: Section ::: 4.1 ::: and :::::::::: Appendices ::: A4 :::: and ::: A5). To weaken 5 the influence of these fluctuations we use the trend of several years for the matching instead of a single year. We therefore require that each time series contains at least three data points spread over a period of at least 11 years. Furthermore, we need time series with the detail of sectors and gases listed in Table 1.

Country reported data
Under the UNFCCC there are several requirements for reporting of greenhouse gas emissions data ( :: see :::: e.g. Yamin and De-10 pledge (2005)). Under the convention both ::: Both : developed (Annex I) and developing (non-Annex I) Parties 7 have to regularly submit communications that include an inventory of national GHG emissions and removals. Detailed requirements, however, differ strongly between Annex I and non-Annex I Parties. Annex I Parties have to submit an inventory : , which covers all sectors, gases, and years since 1990 annually. The submissions should consist of two parts, the common reporting format (CRF) tables with the data and a national inventory report (NIR) : , which gives background information like the rationale behind the selection 15 of emission factors and methodological questions. For details on the CRF tables see Section 2.2.3 below. Annex I Parties also submit national communications : , : which originally served the purpose to report on policies and measures to implement the Party's commitment to aim to return emissions to 1990 levels by the year 2000. The NIRs have recently (decided in 2011 at COP17 8 , Durban) been complemented with Biennial Reports to enhance reporting. The emissions data contained should be consistent with the CRF data. Under the KP ::::: Kyoto :::::::: Protocol :::: (KP) Annex I Parties have to regularly submit information needed 20 to assess whether they are meeting their emission targets. For our purpose the CRF data is the most useful of these sources.
The other sources do not provide additional information for the purpose of this paper and are not used.
Non-Annex I Parties were required to submit an initial national communication within three years after the entry into force of the convention. The least developed countries (LDCs) could decide if they submitted an initial national communication.
The submissions were required to contain an emissions inventory, which covers the years 1990 to 1994 for most submissions.

25
A time frame for subsequent national communications could not be agreed upon, and only few countries submitted further national communications with updated inventories. The guidelines for national communications for non-annex I Parties are less stringent than the guidelines for Annex I Parties, consequently the coverage and detail in sectors and gases of the data differs strongly between countries. Since 2014 non-Annex I Parties are required to report GHG inventory information through Biennial Update Reports (BUR). The first report was due by December 2014, however only 24 of over 150 countries actually The Paris agreement requires regular national inventory reports by all Parties, : which might improve emissions reporting in the future (UNFCCC (2015b), Article 7 (a)).

National Communications and National Inventory Reports for developing countries [UNFCCC2015]
5 Most developing countries reported historical emissions data at least once using National Communications (UNFCCC (2015a)) and sometimes National Inventory Reports. However, several countries only reported data for the period of 1990 to 1994, sometimes only single years. Therefore, a lot of countries' submissions do not meet our minimal data requirements and are consequently not used for the composite source. Where the data meets our requirements we use it with high priority as it is prepared by in-country experts, : which gives the results based on it high credibility within the country, which is beneficial for 10 policy analysis. We compare it with third party data to identify if there are differences that can not be explained by uncertainties. The :::::: source ::::::::::: preprocessing :: is :::::::: explained :: in ::::::::: Appendix :: B.
RCP historical data are compiled from a wide range of emission sources and atmospheric concentration measurements.

SAGE Global Potential Vegetation Dataset [SAGE]
This dataset is available in the SAGE (Center for Sustainability and the Global Environment) database and is described in Ramankutty and Foley (1999) and available for download from Ramankutty and Foley (2015). It contains 5' resolution grid maps of potential vegetation (i.e. vegetation that potentially could be in a certain spot if there was no human interference) for a time period from 1700 to 1992. It has been used together with HYDE 3.1 in Matthews et al. (2014) to downscale CDIAC land 10 use CO 2 emissions to country level. We use it for the same purpose here.

Source prioritization
To create a dataset covering all countries and gases for a period of over 150 years, multiple data sources need to be combined as no single source contains all the necessary data. We order sources such that the highest quality sources are selected for each gas, category, and year according to availability. Where possible, source prioritization is defined, and used, at a global level.
3.1 Emissions from energy, industrial processes, solvent use, agriculture, and waste For fossil emissions our highest priority source is the UNFCCC CRF data as it is both accepted by the countries that report and by other countries as it is peer ::::: expert : reviewed. However, it is only available for developed country Parties. We use CRF2014 and fill gaps with CRF2013 where necessary. For non-Annex I Parties we use data from National Communications 25 and National Inventory Reports with highest priority (UNFCCC2015). For a few developing countries, data from the Biennial Update Reports (BUR2015) is available and fulfills our minimal requirements. It is used to supplement The UNFCCC2015 data. UNFCCC2015 is prioritized over BUR2015 because the latter only contains a few data points for most countries while the UNFCCC2015 data contains full time series for more countries. Those sources of UNFCCC reported data cover a wide range of gases and sectors (for most countries CO 2 , CH 4 , and N 2 O for all sectors at the level of detail needed for the composite 30 source. Fluorinated gases are only contained for a few countries). For fossil fuel burning related CO 2 , CO 2 from flaring, and CO 2 emissions from mineral products we use CDIAC as the next source. For CO 2 from other sectors and all other gases we use a combination of EDGAR v4.2 FT2010 and EDGAR v4.2 as the next source. It is also used to complement CDIAC data where necessary (e.g. for small countries missing in the CDIAC source). BP2015 data is used to extend the energy CO 2 time series until 2014. Where no country reported data is available the country resolved data sources are used as the first sources.

5
Sources without country level information, RCP and EDGAR-HYDE, are used to extrapolate emissions into the past. As EDGAR-HYDE has a higher regional and sectoral resolution it is used as the first priority source for extrapolation of CO 2 , CH 4 , and N 2 O emissions. Emissions from fluorinated gases for years before 1970 are only available from the RCP historical data and only on a global level.
The source prioritization for the individual gases is summarized in Tables 2, 3, 4, and 5. Details of the source creation 10 methods are available in Section 4.1.
3.2 Land use, land use change, and forestry emissions The first priority sources are country reported data which are supplemented with FAOSTATdata. None of these sources ::::: source :: for :::: land :::: use :::: CO 2 :: is ::::::::: FAOSTAT. :::::::: However :: it :::: does ::: not contain information for the period before 1990. EDGAR42 does contain information starting in 1970 but excludes sinks from the calculation of CO 2 land use emissions, : which is why we exclude 15 EDGAR CO 2 land use data from our dataset. The period before 1990 is covered by the Houghton dataset on a regional level, which we downscale using estimates of historical deforestation (see Section 4.2).
For CH 4 , and N 2 O we use country reported data, FAOSTAT, and EDGAR data on a per country basis. Regional growth rates from EDGAR-HYDE14 are used to extrapolate the time series.
4 Dataset construction 4.1 Emissions from energy, industrial processes, solvent use, agriculture, and waste The generation of the emissions time series is carried out using the Composite Source Generator (CSG) of the PRIMAP emissions module described in Nabel et al. (2011). Data is aggregated on a per country, per gas, and per category level taking 25 into account source prioritization (see Section 3). The result is one time series for each country, category, and gas. The source creation is organized in four steps described below.
Composite Source Generator The composite source generator (CSG) works on every country, gas, and category individually.

5
Its core is the priority algorithm, : which combines the sources following a given prioritization. The algorithm starts with the highest priority source. Missing time series are copied from lower priority sources. After this step the priority algorithm fills gaps in the time series using lower priority sources and extrapolates using :::: year :: to :::: year growth rates from lower priority sources. For each gas, category, country time series it is checked if the composite source contains gaps or does not cover the full time period. If that is the case the second highest priority source is checked for data that 10 could fill gaps and extend the time series. If that time series itself contains gaps or needs extension, the hierarchy is parsed downwards recursively and the resulting time series is used to extend the composite source. For details on the harmonization see Appendix A4. For this study we add one source at a time and therefore do not parse the sources recursively but add what is present in the next source and then see if the resulting time series needs further extension.

Inter-and extrapolate over time (priority alg.)
Complete the composite source time series by using growth rates from lower priority data Extrapolation Extrapolate time series using regional data or numerical methods.
Sudan needs a special treatment as the split into Sudan and South Sudan has been so recent that no separate emissions data is available yet. We downscale the Sudan emissions time series to Sudan and South Sudan using UN population statistics (UN Population Division (2015)) as a downscaling key. We also aggregate country data for some regional groups.
10 Figure 3 shows an example of how we build a pathway from different time series.
In the following we describe the data availability and use in detail for the different gases and sectors. To further extend time series into the past we use EDGAR-HYDE regional growth rates (starting in 1890). For categories 1A, 1B1, and 1B2 explicit time series are available while we use category 2 time series as a proxy for the subcategories of category 2. Other categories are not available. RCP CO 2 data that ranges back until 1850 is only available for total 20 emissions excluding LULUCF on a global level. As total CO 2 emissions are dominated by fossil fuel burning we use the RCP data as growth rates to extrapolate category 1A emissions for those countries, : which were not covered by CDIAC and EDGAR-HYDE from 1850 onwards. This does not affect any mayor emitter at the time for which data is extrapolated. For categories 3, 4, 6, and 7 no source for extrapolation is available so the first year is 1970 from EDGAR.
We use growth rates of the the fossil fuel consumption time series for each country as a proxy to extend the time series 25 of all other sectors to 1850.
Step  Table 3. Source prioritization for fossil and industrial CH4. Years are maximal values. Some countries have less coverage. In CRF a few countries have data starting a few years before 1990. :::::: Category ::::: names :::: refer :: to ::::: IPCC :::: 1996 :::::::: categories. Note that there are no CH4 emissions data in category 3 (Solvent and Other Product Use) a linear decline to zero in 1850 from the last year with data starting from a 21 year linear trend. In category 3 there are no CH 4 emissions reported. The source prioritization and extrapolation is summarized in Table 3.  Table 4. Source prioritization for fossil and industrial N2O. Years are maximal values. Some countries have less coverage. In CRF a few countries have data starting a few years before 1990. ::::::: Category ::::: names :::: refer :: to :::: IPCC :::: 1996 :::::::: categories.
to 1970 we use the regional growth rates from the EDGAR-HYDE dataset to extrapolate categories 1, 2, 4, and 6. For the period prior to 1890, the RCP database provides data, but only at a global level and without sectoral detail. We know of no source that provides regionally or sectorally resolved N 2 O emissions prior to 1890. The main contribution to N 2 O emissions comes from the agricultural sector, especially the use of manure and nitrogen fertilizers (Davidson (2009)). N 2 O emissions are therefore not well correlated with CO 2 or CH 4 emissions as these have different sources 5 and thus they can not be used as a proxy for N 2 O emissions. Data on fertilizer use is only available for a few countries for years earlier than 1961 (Federico (2008)). This is not sufficient for downscaling of agricultural N 2 O emissions. We therefore use the RCP global growth rates : , which are computed from atmospheric concentration measurements to extend the country time series into the past for all sectors. :::: The ::::: source ::::::::::: prioritization ::: and ::::::::::: extrapolation :: is :::::::::: summarized :: in ::::: Table :: 4. : Fluorinated gases Country reported data covers 1990 to 2012 for all Annex I countries and some non-Annex I countries.

10
Other countries are added from EDGAR 42 : , which also extends existing time series to start in 1970. To extrapolate the data to 1850 we use RCP global growth rates. RCP data and global emissions from EDGAR data are in very good agreement for the time of overlap of the two sources for SF 6 , HFCs, and PFCs. The time series are obtained using different methods: EDGAR from activity data and emission factors and RCP from inverse emission estimates based on atmospheric concentration measurements. This is a good sign with respect to the uncertainty in the datasets. Because of 15 the similarity in absolute emissions, using RCP growth rates to extend EDGAR data does not significantly alter the global emissions compared to the RCP and is a safe method to obtain emissions back until 1850. Emissions from F-gases :: for ::: the ::: first ::::: years :: of ::: use :: of ::::::::: fluorinated ::::: gases. ::::::::: Emissions :::: from :::::::::: fluorinated :::: gases : are generally very low before 1950 as their largescale production and use only started in the second half of the 20th century. Technology for large scale production of HFCs was developed in the late 1940s. For PFCs a major breakthrough in industrial production was the Fowler process : , 20 which was published in 1947 and Industrial production of SF 6 began in 1953 (Levin et al. (2010)). The IPCC "Special Report on Safeguarding the Ozone Layer and the Global Climate System" (Metz et al. (2007)) estimated emissions Step Source  Table 5. Source prioritization for fluorinated gases. Years are maximal values. Some countries have less coverage. In CRF a few countries have data starting a few years before 1990. F-gas ::::::: Category ::::: names :::: refer :: to :::: IPCC :::: 1996 :::::::: categories. ::::::::: Fluorinated ::: gas emissions are only reported in category 2. For some countries, data in the BUR and UNFCCC sources is only available for SF6 : .

Emissions from land use
The largest share of emissions from land use, land use change, and forestry (LULUCF) is in the form of CO 2 originating from deforestation. 15 We therefore focus on CO 2 emissions and use a simpler method for CH 4 and N 2 O emissions. The preparation of the LULUCF pathways follows the same steps as for the fossil fuel and industry pathways. However, due to the high fluctuations in LULUCF data the harmonization of sources is problematic (e.g. when one source shows a sink while 10 another source shows emissions for the same period of time). We therefore use the time series from different datasets directly without harmonization. In the preprocessing the Houghton source needs downscaling : , which is described below.

Downscaling of HOUGHTON2008
The Houghton source only resolves 10 regions: Canada, China, Europe, Former USSR, Northern Africa and Middle East, Pacific Developed Countries, South and Central America, South and Southeast Asia, Tropical Africa, and the USA. Data for all countries except Canada, China, and the USA has to be computed using downscaling of regional emissions. Estimates of historical deforestation can be computed starting from models of the amount of cropland and pasture required to feed the population in a certain area at a certain time. This time series gives estimates of the land converted to cropland 15 or pasture in that area. Using a dataset of potential natural vegetation (i.e. simulated vegetation in the absence of human interference like deforestation) we compute : , which fraction of that land was likely covered by forests before the conversion.
This gives us a time series of deforested areas on a grid map of the world. The gridded data is transferred into country data using country masks.
The cropland and pasture data is taken from the History Database of the Global Environment (HYDE dataset also includes a PFT for Savanna, which we included in the "non-forest" category. Although loss of biomass from savanna land has contributed to historical CO 2 emissions, we chose to exclude it from this dataset because the carbon density is substantially different to that of forest or woodland areas occurring in the same region. The CO 2 emissions downscaling scheme assumes uniform carbon density of vegetation throughout each region, so Savanna was excluded to avoid skewing results. While the different forest PFTs also have different carbon contents, the variability within a region is much smaller than 5 the difference between forest PFTs and savanna within one region. See e.g. Figure 1 of Liu et al. (2015).
The area converted to agricultural land, the sum of cropland and pasture, and that coincides with land that would otherwise be forested is calculated to determine the areal extent of deforestation, and reforestation, over 10 year time steps for each grid cell. Spatial data is converted to country time series using an area-weighted summation according to the country boundaries data of the Food and Agriculture Organization of the United Nations (2015c). See also Figure 4.

10
To downscale the regional emissions data we make the assumption that forests in a region have the same average carbon content. So for any two countries in a region, we assume that converting one hectare of forest into cropland in one country releases the same amount of CO 2 to the atmosphere as converting one hectare of forest in the other country. The time resolved data exhibits strong fluctuations : , which do not necessarily coincide with fluctuations in the emissions data. One reason for this are the different methodological approaches used to create the two datasets. While the Houghton dataset models actual 15 emissions from deforestation in detail, the method to calculate deforested area uses datasets : , which are of more theoretical nature. The HYDE dataset models the need for agricultural area in a region and does not represent the agricultural area that was actually present at that time. When population changes, the need for agricultural area changes with it, but the actual agricultural area changes more slowly. This is especially visible in Europe during the second world war. Population and thus need for agricultural area declined rapidly, leading to afforestation in the model. In reality, agricultural area will remain unused for the whole period of 1850 to the last data year in the Houghton source to downscale the regional emissions to country level.
This approach is also taken in Matthews et al. (2014). Details are given in Appendix B.
For some small countries and countries, : which became recently independent no emissions data is available yet. In this case  (2015)).

Data availability 15
The dataset is available from the GFZ Data Services under the url http://doi.org/10.5880/PIK.2016.003 (Gütschow et al. (2016)). When using this dataset or one of its updates, please cite this paper and the precise version of the dataset used.
Please consider also citing the relevant original sources when using this dataset. Any use of this dataset should also comply with the usage restrictions of the original data sources used for this project.

20
In this section we show some key results of our analysis. Details for additional countries, sectors and gases can be explored on-line on our companion website www.pik-potsdam.de/primap-live/primap-hist/. Here we focus on major emitters and global emissions.
Before the industrial revolution, deforestation was the major emissions source followed by agriculture. Currently, these sectors 16 GDP data is not available.
are the second and third largest sources. Roughly 10% of emissions come from waste and industrial processes. Industrial processes increased their share in emissions after 1950 while the share of waste related emissions stayed relatively constant.
The sectoral profile differs strongly among countries ( Waste gives a small contribution, differing by country without a clear split between developed and developing countries. The contribution of industrial processes is larger in industrialized countries, but especially large in China. 15 6.2 Gas distribution of :::::::: economy :::: wide : emissions for major emitters The contribution of individual gases and gas groups to (GWP weighted) ::::::: economy ::::: wide ::::: (IPCC ::::: 1996 ::::::: category :: 0) : emissions is shown in Figure 6. It is clearly visible that CO 2 is by far the largest contributor followed by CH 4 and N 2 O, both globally and for individual countries. The contribution of fluorinated gases is in general small and negligible for developing countries.
Again, China's emissions profile is closer to that of an industrialized country than to other major developing country emitters.

Uncertainties
In this paper we do not assess the uncertainties of the dataset in detail. Of the individual datasets used, uncertainty information is available for some while for others it is not provided. Where it is available, the level of detail is very different. Some datasets 25 give per country or per regional group uncertainty estimates while others only provide global estimates. Individual uncertainty estimates can be over 100% :::::::::::::::::: (Olivier et al. (1999) ). To calculate uncertainty estimates for all countries, gases and sectors for the composite source one has to transform the information given for the individual sources to a common methodology and level of detail and combine it in line with the creation of the composite source. As most datasets come without an uncertainty estimate and third party estimates are scarce for some datasets it is hard to find a consistent set of uncertainty estimates. Furthermore, 30 different studies use different sectoral resolutions, confidence intervals etc., which makes it difficult to compare and combine the results to arrive at an estimate for our aggregate source. We leave this task for a future publication. In the following, we ::: The :::: figure :: is ::::::: discussed :: in :::::: Section ::: 6.1.
27 1860 1880 1900 1920 1940 1960 1980  give a broad overview of the uncertainties of individual sources and compare this dataset to other sources to get an estimate of differences and uncertainties of the source.
7.1 ::::::::::: Uncertainties ::::: from ::::::::: individual ::::::: sources Uncertainty estimates for the CDIAC dataset of global CO 2 emissions from fossil fuels and industry have varied since the first assessment made by Marland and Rotty (1984), which resulted in an uncertainty range between 6 and 10% (using a 90% confidence interval). In a recent publication a single global fossil fuel CO 2 emission uncertainty of 8.4% (using a 95% 10 confidence interval) is offered as a reasonable combination of data (Andres et al. (2014)), in an attempt to simplify the different assessments and to make the best of the qualitative and quantitative knowledge developed since the first study of 1984.
The EDGAR-HYDE data shows relatively low total Kyoto GHG values. The sectors ::::: sector plots show that this is due to low for some sectors missing in EDGAR-HYDE14, namely grassland and forest fire emissions. 17 However, the discrepancies can not fully be explained by this as they are present also in other sectors than land use.

30
For N 2 O, MATCH and EDGAR42 emissions :::::::: aggregate ::::: Kyoto ::: gas :::::::: emissions ::: are : lower than the PRIMAP-hist dataset while EDGAR-HYDE14 and RCP are higher. MATCH uses EDGAR-HYDE (growth rates) prior to 1990 : , which explains the very similar pathway profiles and leads to very low emissions outside the uncertainty range before 1970. 17 International shipping and aviation emissions are also added, but not included in this study. Finally, the estimates of emissions of fluorinated gases are higher for EDGAR42 than for our aggregate dataset. This means that country reported f-gas :::::::: fluorinated :::: gas emissions are significantly lower than what EDGAR calculates the emissions to be.
We begin with key sources for uncertainties and differences between datasets.
-Different methodologies for estimating emissions: some datasets are based on end of pipe measurements :::::::: measures, others on economic activity data and assumed emission factors. Global emission datasets can also be based on inverse emission for the electricity production of individual power plants differs. Similarly the data on the exact fuel type used and the emission factors used influence the resulting emissions.
-Differences in the detailed definitions of sectors: there are different ways to categorize emissions by economic sectors and not all data sources use the same categories. Categories from different sources can differ in their exact content despite 5 having broadly the same definition.
-Different assumptions made for variables without data: the uncertainties are especially high for countries without a An overview of the relative uncertainties for the different sources, countries, gases, and sectors is presented in Section 7.
To create a composite dataset we first prioritize the different data sources according to our judgment of their reliability and completeness. More complete sources at the top levels in the hierarchy will create a more consistent dataset than sources, : which 15 cover only a few sectors or gases. However, if the top-level sources are unreliable, the resulting dataset will be unreliable and it is beneficial to prioritize more reliable but less complete sources. Completeness has different dimensions : , which we can often not optimize at the same time. Some datasets are very extensive in time and country coverage, but only cover a few gases and sectors (e.g. CDIAC), while other sources cover only a fraction of the countries and years but with almost perfect sectoral and gas resolution (e.g. CRF, UNFCCC, BUR).

20
The first priority source is used as an anchor point for the other sources : , which are used to extend the time series and to fill gaps. There are different options for the harmonization needed when extending one source with data from another source. We present some options below, a more detailed discussion is available in Rogelj et al. (2011): 1. no scaling: this does not alter data, but also does not use information from the first priority source to improve data from the lower priority sources.

25
2. full scaling: here we scale the lower priority sources such that they match the higher priority sources at the borders.
Effectively we are using the growth rates of the lower priority sources to extend the higher priority source. If e.g. an emission factor is different for the two sources leading to a large difference in absolute emissions, the growth rates would still be the same and the extension with scaling would effectively use the emissions factor of the first source also for the second source. Of course not all differences come from multiplicative errors like different emission factors. There 3. shifting using an offset: the lower priority time series is harmonized by shifting the complete time series by a constant.
This method implicitly assumes a constant error over time, : which is not realistic if the emission time series is not constant. For extrapolation to the past it will likely overestimate emissions while it will likely underestimate emissions for extrapolations to the future (assuming rising emissions).
We use a mixture of approaches 1 and 2. We use scaling but limit the scaling at a factor of 1.5 to avoid introducing additional 5 errors in case of extremely different emissions data.
When combining the different sources we can not take into account all their methodological differences. Often the exact assumptions and underlying data are not published with the datasets and an assessment of the uncertainty of the individual datasets is difficult as useful analysis is scarce (see also previous section). Thus, sometimes a time series using a slightly different sector definition is used to extend another time series. This introduces inconsistencies into the final dataset.

10
In Section 7 we presented uncertainties of the individual sources, sectors, and gases, : which can reach over 100% for some gases and sectors. We have to keep that in mind when designing and judging our methods. A very fine tuned and subsector resolved method for the combination of datasets is still bound to the limitations of the input data and their uncertainties. While it is always possible to improve methods to reduce their uncertainty it is not always sensible to invest more time if the major source of uncertainty is the input data and not the processing. Before adding further detail to the PRIMAP-hist dataset it has to 15 assessed if they add :: be ::::::: assessed :: if :: it :::: adds real value to the data or are overshadowed by uncertainties of the input data.
When using emissions data one has to respect the uncertainties and limitations of the datain mind. When making a statement about emission intensities in different countries the differences have to be seen in relation to the uncertainties before deducing anything from the calculated values. Individual country uncertainties can be much higher than the global uncertainties presented in Table 10. One of the purposes of this dataset is the calculation of countries contributions to climate change. Again we have 20 to keep uncertainties in mind. This data set can be used to study general effects, such as the impact of pre-1950 emissions on 2100 warming, rather than the exact emission targets for all countries according to a given equity principle (unless one accepts and communicates the uncertainties of the resulting emission targets).
The land use downscaling methodology could be improved by a more detailed treatment of the different plant function types and the inclusion of savannas. For example :::::::::: Furthermore, the HYDE data does not account for deforestation for firewood, : which 25 influences the estimates of deforested areas and the SAGE potential vegetation dataset also removes the human influence on the climate from the simulation. Climate is influenced globally and thus some of the discrepancy between potential and actual vegetation is caused by global climate change and not by local deforestation. 18 Finally, : we have to note that the last years are obtained using extrapolations for most sectors and gases. Therefore these data can not be used to make statements about short term emissions trends. We provide a version of this dataset that does not use 30 numerical extrapolation to the future that can be used for this purpose. Where regional data is used for extrapolation to the past individual country developments are not taken into account and can not be deduced from the data. Short term trends can also be influenced by the combination of different sources, thus the consultation of original sources is advised before making statements about such trends.

A1 Preprocessing
We use the same methods of preprocessing for all sources, though not all steps are used for all sources. Source specific information is provided in Appendix B.

A1.1 Zero data and implausible data
We remove all time series that contain only zero values to ensure that zero values in higher priority sources do not prevent the use of non-zero data from lower priority sources. In case negative data occurs in time series that physically have to be positive we replace the negative data by zero.

A1.2 Gas and category aggregation
10 Where necessary we aggregate gases to gas baskets (e.g. individual HFCs to the HFCs basket). If data is available at a more detailed sectoral level, we aggregate the categories to obtain time series at the sectoral resolution needed for the PRIMAP-hist dataset. In the process of aggregation we fill gaps in individual time series and extrapolate individual time series such that all gases / subsectors cover the same time period. Details of the extrapolation methods are discussed in Appendix A5.2 below. The same aggregation routine is also used in postprocessing to aggregate higher categories and the KyotoGHG basket.

A2 Accounting for territorial changes
Where necessary countries are summed or split to match our territorial definitions. Where only aggregate information is available we use downscaling to obtain country level information. In case we have to downscale emissions of formerly existent larger countries to the current individual countries, we downscale the larger countries' emissions using constant shares defined by the average of the first five years with data for the individual countries. This is used e.g. for countries of the former USSR. If no data for individual countries is available, we use 20 an external downscaling key e.g. emissions from a different source or GDP. When countries merge we sum the individual countries. This is used e.g. for Germany.

A3 Downscaling
We downscale regional data using country shares calculated from a different source, the key. Before downscaling, the key is preprocessed such that time series for all countries present cover the whole period to be downscaled. Extrapolation of country pathways is done using 25 the growth rates of all countries present in the region. This implies that the shares in regional emissions of countries with missing data stay constant from the last year with data (both for extrapolation to the future and to the past). If no data is present for any country in a region for a certain year it is extrapolated using constant emissions implying constant shares for the downscaling. Once the key time series is complete the downscaling itself is done by multiplication of the country shares with the regional data. emission time series due to scaling.
In case of land use emissions we do not use scaling, but fill gaps with unchanged data from lower priority sources. The high fluctuations of land use data including different signs for data from different sources for the same year introduce high uncertainty in the scaling and render it meaningless in some cases, e.g. when one dataset shows removals while the other shows emissions for the period of overlap.

10
A5.1 Extrapolation with regional growth rates For each region in the extrapolation source we loop over all countries contained in the region. We identify if there are years within the given span where the extrapolation source contains data that could extend the country data. If this is the case, we compute the value for the last year without data for the country (the matching year) given by a linear trend. We compute the trend from opposite sides, i.e. for extrapolation to the past from 1850 to 1890 we compute the 1890 value of the country data from a linear trend through 1891 to 1905 and the 1890 value for 15 the regional data from a linear trend through 1876 to 1890. The regional time series is then scaled such that they are identical in the matching year and we extend the country data with the resulting time series. Unless stated otherwise we use 15 year trends.

A5.2 Numerical extrapolation
In this paper we use numerical extrapolation for extension of time series to the past on the scale of decades where historical data is not available, e.g. for land use N2O and CH4 emissions. It is also used before the gas and category aggregation process to extrapolate those time 20 series for individual countries, gases and categories, : which do not have data for the latest years to 2014.
Our framework for numerical extrapolation consists of different methods for extrapolation and a wrapper that controls the results and uses a fall back option if necessary. The following options are available: Constant Data is extrapolated with a constant value : , which is computed as the mean of the n last values before the extrapolation. Constant extrapolation has no fall back option.

25
Linear A linear trend is computed from the last n years before extrapolation. This trend is continued for the period of extrapolation. To control the extrapolated pathway it is checked if it crosses zero (negative emissions are currently impossible for most gases and sectors and have to be excluded). If crossing is not allowed, the fall back option for this case is used. The default option is to replace all values after the crossing point by zero. If emissions are extrapolated to the past and a trend is computed : , which has higher emissions in the past a fall back option is triggered as well. The default is linear to zero extrapolation.

30
Linear to zero A linear pathway is constructed from a starting value to zero in the last year of the extrapolation. The starting value is computed from the linear trend of the last n values. If the calculated value is below zero despite all n values being positive, we use the last value instead of the value calculated from the linear trend. There is no fall back option.
Exponential The last n years are used to fit an exponential function, : which extrapolates the data. A fall back option is used if exponential fitting is not possible (e.g. when the n years contain positive as well as negative values), if too few of the n years have data available, or if during extrapolation to the past we obtain a negative exponent (i.e. emissions in the past higher than in the future
EDGAR-HYDE14 EDGAR-HYDE data uses the EDGAR v2.0 categorization : , which differs from the IPCC 1996 categorization used here.
The IPCC 1996 categories we identify with the EDGAR42 categories are shown in Table 11. The summation of subcategories is done using the emissions module's aggregation framework. We do not use international bunker fuel emissions (EXX) as we do not include 5 bunker fuels in this analysis. Data is interpolated using Matlab's 'pchip' function.
FAO2015 Like CDIAC, FAO data explicitly models splitup and unification of countries. Our first step is to sum and split these countries to obtain time series for the current countries and the territorial definitions used here (see Section 4.3).FAO uses different subcategories for agriculture and land use than IPCC 1996 : , which need to be translated to IPCC 1996 categories. For this paper the details are not relevant as we operate on aggregate agricultural and land use data.
10 HOUGHTON2008 The downscaling is described in Section 4.2.2. Here we add some details. The downscaling uses regional shares in cumulative deforested areas to split the regional emissions pathway to countries. In some regions there are both countries with net deforestation and net afforestation, so some countries have negative shares, : which can not be used directly for downscaling in a meaningful way. Instead we first calculate shares from only deforestation and multiply those with the regional pathway to obtain preliminary emission pathways. These pathways are then shifted such that the cumulative net emissions (or removals) equal the 15 cumulative net emissions (or removals) calculated directly from the net deforestation shares. This approach avoids inverted growth rates for countries with net afforestation in a region with net deforestation.
Countries missing in the Houghton source are added using the regional growth rates and shares computed by the relative deforestation compared to a Houghton region with similar climate.
HYDE No preprocessing is needed.

20
RCP Data is first interpolated using MATLAB's 'pchip' function. For CH4 we aggregate time series to the necessary regional level. HFCs and PFCs baskets are created. For CH4 from categories 1, 2, and 4 the years 1860 -1880 are removed before interpolation. They show a steep decline to almost zero emissions from 1850 to 1860 : , which rise again to much higher values in 1890. This can not be observed in the data presented in Lamarque et al. (2010), : which is the original source of the data according to the RCP website (Meinshausen (2011)). We judge this to be an error and thus replace the values by interpolation. :::: RCP ::: data :: is ::::::: published :: in :::: IPCC :::: 1996 :::::::: categories, :::: thus 25 :: no ::::::: mapping : is :::::: needed. : SAGE No preprocessing is needed.

Appendix C: Data sources not used
In this section we describe data sources that were considered but not used in the final composite source and give the reasons why the data 30 was not used.

C2 National communications by developed countries
National communications by developed country Parties UNFCCC (2014b) serve the purpose to give information on the commitments Parties 5 are undertaking to limit their greenhouse gas emissions and the policies implemented and planned to reach the commitments. They contain some greenhouse gas data but historical data does not add to CRF data, so national communications by developed country Parties are not used here.

C3 CAIT 2.0
The Climate Analysis Indicators Tool (CAIT) dataset is published by the World Resources Institute (WRI). It contains data for several All sources are either included in our dataset individually (CDIAC, FAOSTAT), not publicly accessible (IEA), or only contain emissions already covered from other sources (EIA, USEPA). We do not use CAIT data, as the results are more transparent when using the original data sources directly.

C4 CDIAC CH 4
This dataset has been described in Stern and Kaufmann (1995, 1996, 1998 and covers global CH4 emissions for a period from 1860 to 1994. It is created using correlations of methane emissions to socioeconomic variables or emissions of other gases for which time series are available. It is tested against emission estimates from measurements of atmospheric methane concentrations. Due to its lack of country or regional data it could only be used for extrapolation. However, we have RCP data that covers the same period and sectoral detail but has a 25 regional resolution. we therefore do not use the CDIAC CH4 data.

C5 EIA Energy CO 2
The U.S. Energy Information Administration's (EIA) publishes CO2 emissions from energy consumption for most of the world countries.
The period from 1980 to 2012 is covered. The covered sectors are consumption of coal, petroleum, and natural gas (together these correspond to IPCC 1996 category 1A) and flaring of natural gas (IPCC 1996 category 1B2C22).

C6 IEA Energy CO 2
The International Energy Agency offers CO2 emissions from fuel combustion for purchase. The dataset covers 34 OECD countries and 100 non-OECD countries. As it is not publicly available we do not include it in our dataset.

C7 USEPA
The United States Environmental Protection Agency (EPA) published data for non-CO2 emissions (US Environmental Protection Agency

5
(2012)). The dataset covers many countries and the years 1990 to 2005. It is a composite of different data sources where publicly-available country-prepared reports are prioritized. A main source for the historical data is the UNFCCC flexible query system. Annex I countries therefore use CRF data while non-Annex I countries use data from the National Communications and National Inventory Reports. However, each time series has only a few data points. We already include the individual sources used in this dataset and only little information is added.