The international political economy data resource

Quantitative scholars in international relations often draw repeatedly on the same sources of country-year data across a diverse range of projects. The International Political Economy Data Resource seeks to provide a public good to the field by standardizing and merging together 951 variables from 78 core International Political Economy data sources into a single dataset, increasing efficiency and reducing the risk of data management errors. Easier access to data encourages researchers to perform more robustness checks in their own work and replicate others’ published results more often. It also and makes it easier for teachers of quantitative research methods to assign realistic exercises to their students. This resource will be updated and expanded annually. The full resource is available via the Harvard Dataverse Network, with versions also available via the Niehaus Center for Globalization and Governance at Princeton University and NewGene.

unobservable variables of interest, the number of indicator variables used in a single paper can rise into the hundreds. 1 Thus, gathering together and compiling a diverse array of variables has become a substantial part of the labor of quantitative scholarship in international relations.
Country-year data in international relations in general and international political economy (IPE) in particular are fragmented in their sources, drawn from economics, political science, sociology, international business, and other fields. Given the low readership of most journals by scholars from outside the discipline, even identifying the most appropriate existing data sources is challenging. Once identified, merging multiple datasets together requires much labor and great care, as errors at this stage can undermine all the work that follows. Scholars of international security have long utilized EUGene for a compilation of important variables. 2 The IPE Data Resource brings together a much broader range of variables, focusing on those of interest to political economy scholars broadly construed, including political scientists, development-and macro-economists, and scholars of international business.
Despite the diversity of topics studied, many political economy projects utilize the same standard set of core country-year data sets to measure economic, political and social conditions across countries and over time. This area of overlap has only grown as economic factors have become more central to the study of political and military outcomes and as political factors have been more fully considered in analyses of economic outcomes and business strategy. This creates an opportunity for large efficiency gains through coordination. Currently, the process of merging these core datasets together is repeated scholar by scholar, project by project. This resource aims to improve and streamline this data management process.
Increased efficiency in data management not only reduces errors and saves time, it encourages researchers to perform their work more thoroughly. The cost of identifying, obtaining, and preparing alternative measures deters scholars from running additional robustness tests that would increase the quality of their work. All the measures we work with as quantitative social scientists are flawed in some way; use of more alternative measures reduces the odds that the results we publish are driven by these flaws. A centralized source for data is particularly valuable in an interdisciplinary topic area such as IPE where information about new data resources can be slow to diffuse across disciplinary boundaries.
In addition to increased robustness checks, aggregation resources can be effective limiters of p-hacking by increasing the ease of replication and introducing greater data transparency. Altering model specifications until the desired level of statistical significance is reached is a known issue in social science research (e.g., Monogan III 2013;Humphreys et al. 2013). Over the last several decades, the amount of data made publically available has increased substantially (e.g., Coppedge et al. 2016;Cruz, Keefer, and Scartascini 2016;Marshall and Cole, 2014;Fouré, Bénassy-Quéré, and Fontagné 2012), heightening the possibility of replication studies, which are critical to resolving this issue (Laitin 2013). A key contribution of this resource is to provide the data necessary for scholars to move beyond simply reproducing published results using author's published replication data, and allowing them to more fully probe the robustness of published results, testing whether those results are sensitive to the use of alternative measures for key variables. By bringing the most commonly used variables in IPE literature into a single resource that is regression-ready and easily accessible, the cost of such robustness testing is reduced. Having such a large number of variables available does, of course, make it easier for a scholar to simply download the data and run regressions until one with significant results appears. However, it also makes it easier for replication studies to expose and overturn such tenuous results.
Efficient data management is valuable in teaching as well as research. Professors teaching quantitative methods courses sometimes shy away from assigning Breal world^data exploration exercises due to the difficulty of pulling together the necessary datasets for their students. Similarly, one of the motivations for this resource comes from a common assignment in graduate statistics coursesto reproduce and probe the robustness of a core finding in the literature. Having a dataset like this on hand from which students can draw additional/alternative regressors makes it easier for students to complete such a project in the span of a semester.
This project standardizes and merges together 78 commonly used datasets drawn from a range of disciplines into a single, analysis-ready country-year dataset. 3 Political institutions, economic and social conditions, and political violence are all tightly intertwined, and thus, while the content area of these data is focused on IPE, we expect this resource to be of interest to scholars doing cross-national work across the social sciences. The diversity of disciplines we hope will be served by the resource is reflecting the diversity of the fields represented in the data: in addition to political science and economics, the resource contains component datasets related to international business (e.g., Kingsley and Graham 2017); law (e.g., Linzer and Staton 2015); education (e.g., Barro and Lee 2013); religion (e.g., Brown and James 2015); and geography (e.g., Mayer and Zignago 2011).
The resource will be updated and expanded annually, incorporating new data sources as they become available. The boundaries of the field of IPE are inherently arbitrary; these annual updates allow additional data to be added to the resource as requested by its users. 4

Method
The IPE data resource is organized into four general topic areas: politics (37 datasets); economics (31 datasets); social conditions (8 datasets); and geography (2 datasets). See Table 1 in the Appendix for a full list. From some of the largest datasets, which contain hundreds of variables, we retain those variables we expect to be of most use to scholars working in areas related to IPE.
While we expect this data resource to be quite useful to security and civil conflict scholars, this resource includes only a limited number of measures of interstate and intrastate war. 5 We largely leave it to scholars in these areas to merge in their dependent variable(s) of interest, which may be quite varied in form.
The most critical decision regarding how to compile country-year data is the choice of unique country ID numbers. Over time, countries are born, conquered, merged and divided. Following every merger or division, a timeseries-cross-sectional dataset must either treat the resulting units as continuations of previous countries, in which case they retain the same ID number, or as new entities with a new number. In this project, we use the ID numbers associated with the Gleditsch-Ward system (Gleditsch and Ward 1999). However, the resource also includes Correlates of War (COW) codes, International Financial Statistics (IFS) codes, and the country codes from the World Development Indicators. 6 The Gleditsch-Ward system asserts continuation of political units through most mergers and divisions. Thus, modern Yemen, which was formed through the merger of People's Democratic Republic of Yemen and the Arab Republic of Yemen in 1990, is treated as a continuation of the Arab Republic of Yemen. Imperial Russia, the Soviet Union, and modern Russia all retain the same number. When Estonia, Latvia, and Lithuania regain independence in 1991, they regain the same ID numbers they had prior to their loss of independence in 1940.
This presumption of continuity is appropriate to most of the core questions in IPE, as well as the study of intrastate violence and some (though not all) comparative politics questions. 7 In most cases, the same firms that are operating the day before a merger or division of countries continue to operate after it; debts that were owed continue to be owed; contracts remain in place; social identities and divisions persist; and in many cases even the identity of political elites and the structure of many political institutions remains the same (e.g., Acemoglu and Robinson 2006;Alston 1996). Thus the Gleditsch-Ward system treats the economic and social unit (the country) as persisting as well, albeit in altered form. However, to facilitate use of this resource by security scholars, we make available a version of the dataset that is time-series-set for COW codes, which are the dominant set of country identifiers in the study of interstate war. 8 It is the COW version of the data that is integrated into the forthcoming NewGene software.
Many of the component datasets included in the IPE data resource do not include Gleditsch-Ward numbers in their native form, and no other numbering system is hegemonic. Some datasets use IFS codes, others use COW codes, etc. Thus, we append Gleditsch-Ward IDs on the basis of country names, carefully remove duplicate observations by ID and year, 9 and then proceed to merge the datasets together.
Data cleaning involves difficult choices, and choices optimal for one research project may be inappropriate for another. Thus, it is important for scholars using this resource to be able to track and review every step that is taken from when the raw data are downloaded until the final dataset is compiled. To this end, extensive documentation is posted along with the final dataset. The codebook for this project provides the name of each dataset, a description of each variable, citations to the sources from which the data are drawn, notes regarding any data cleaning that was necessary to integrate each component dataset into the resource, and links to the relevant codebooks whenever possible. We also post the raw component data files themselves; the Stata .do files that clean these data files and append the Gleditsch-Ward Numbers (one per raw data file); and the Stata .do file that merges all of the prepped data files together. Thus, every step in the process is transparent, and scholars can review and revise any of the data cleaning choices made.

Accessing the resource
The full version of the resource, including all code and documentation is available via the Harvard Dataverse. 10 The resource will be updated and expanded annually. Version 1.5 of the dataset was released on August 30, 2016, and version 2.0 is scheduled for release in July 2017. Newly released versions will be posted first to Dataverse; Dataverse automatically archives all past versions. 11 The archived versions are critical, as they allow scholars to replicate published studies using the same version of the data used in the original study.
In order to maximize accessibility and serve diverse user communities, we also make the resource available via two other mechanisms. First, an interactive interface hosted by the Niehaus Center for Globalization and Governance at Princeton University, which allows users to download subsets of variables, countries, and years. 12 Second, to aid security scholars in particular, version 1.5 of the IPE Data Resource is 9 In this process, we remove the observation with the least information. In most cases at least one of the duplicate observations contains exclusively missing values. 10 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X093TV 11 https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/X093TV Versions 1.0-1.4 are not on Dataverse and are available from the authors upon request. Other sites that host interactive versions of the resource may experience a lag in posting updated versions of the resource after they become available via Dataverse. 12 http://ncgg.princeton.edu/irdataverse.php scheduled for inclusion in version 1.0 of NewGene, the successor to EUGene, a software program facilitating the creation of dyadic and directed-dyadic datasets for the study of interstate war (Bennett, Poast and Stam 2015). 13 While both of these alternative sites provide the codebook to the version of the resource posted, the code used in the standardization and merging process and the archive of outdated versions are only provided via Dataverse.

Existing resources
A number of scholars and institutions have taken the step of providing catalogs of links to frequently used datasets. For example, Paul Hensel maintains the International Relations Data Site, which is essentially a catalog of links to individual datasets created by various scholars. 14 The Quality of Governance Institute (QoG) provides the resource most similar to that which we offer here, but with more of a comparative politics focus than an IPE focuses . Thus, the largest difference between these two resources is in the substantive topics they cover: 106 datasets are compiled by QoG and 78 datasets here, of which only 28 overlap. 15 However, these resources also differ in their approach to country codes. The QoG (rightly) takes an approach to country codes that is tailored to the study of domestic institutions and processes. Every time countries merge or divide in the QoG data, the resulting units are given new country IDs. Thus, in the QoG data, a unified Yemen is treated as neither a continuation of South Yemen or of North Yemen, but rather as an entirely new entity. This is appropriate in many comparative politics contexts because the nature of domestic politics has been fundamentally altered, but as discussed in the method section, it is less ideal for the study of international relations. Thus, the IPE data resource uses Gleditsch-Ward and COW IDs instead.
The IPE Data Resource also has similarities to the data management components of the Expected Utility Generation and Data Management Program (EUGene) developed by Bennett and Stam (2000), the World Development Indicators (WDI), which are compiled by the World Bank, the GROW up project created by Girardin et al. (2015), and to the PRIO-GRID dataset (Tollefsen, Strand, and Buhaug 2012). EUGene and its forthcoming successor, NewGene, cover a range of variables often used in models of interstate conflict, while WDI covers measures of economic and human development, largely eschewing measures of political institutions. Thus, a key distinction is again the substantive area of focus of each data aggregation effort. PRIO-GRID brings together a range of economic, social, and political information on the basis of geographic space, rather than country boundaries. PRIO-GRID covers some of the same variables as this project, but it brings them together with regard to a different set of geographic unitsquadratic grid cells rather than countries.
These data resources are complements rather than substitutes for one another. Just as NewGene incorporates a version of the IPE Data Resource into its interface, our dataset includes variables drawn from the WDI. Each resource serves a related, but distinct need of the quantitative social science community.

A note on citation
The component datasets compiled in this project were each constructed, through great cost and effort, by scholars and institutions that deserve credit for their work. 16 The codebook to the IPE Data Resource provides the appropriate citation for each component dataset. Any publication that draws on data from this resource must cite to the original datasets from which the variables in question are drawn and this paper, with reference to the specific version of the IPE Data Resource utilized. This allows precise replication of any results produced using the resource and makes clear the source of any data-related errors.

Conclusion
This data resource serves the field by bringing together 78 datasets from political science, economics, international business, and sociology and carefully merging them together in a theoretically informed manner. The resource can be downloaded in its entirety, or subsets of variables, countries, and years can be selected. Datasets are merged together based on the Gleditsch-Ward system of country IDs, but an alternate version of the dataset employing the COW numbering system is also available. Because the project has the institutional support necessary to sustain annual updates, it will be able to remain relevant and up to date, updating existing data and incorporating new component datasets as they become available. A dyadic version of the resource represents the next logical step in this resource creation agenda. A dyad-year resource that includes information on bilateral trade and investment flows, joint membership in international organizations, and other dyadic information is under construction as a complement to the country-year resource described here.
The value of this project comes in the form of increased efficiency, increased research and teaching quality, and the diffusion of new data resources across disciplinary boundaries. As the costs of doing high quality work fall, we can all do better work and more of it.