The International Political Economy Data Resource

Quantitative scholars in international relations often draw repeatedly on the same sources of country-year data across a diverse range of projects. The IPE Data Resource seeks to provide a public good to the field by standardizing and merging together 951 variables from 78 core IPE data sources into a single dataset, increasing efficiency and reducing the risk of data management errors. Easier access to data both encourages researchers to perform more robustness checks than they otherwise might and makes it easier for teachers of quantitative research methods to assign realistic exercises to their students. This resource will be updated and expanded annually and is available via the Harvard Dataverse Network.


Introduction
Cross-national quantitative research in international relations is a data-intensive undertaking. Many researchers create novel measures of the constructs most critical to the question at hand -i.e. their variable(s) of interest. However, between control variables in the core model and a range of alternative measures employed as robustness checks, it is not uncommon for a single paper to utilize thirty or forty variables drawn from a dozen distinct datasets. In papers that use latent variable models to estimate unobservable variables of interest, the number of indicator variables used in a single paper can rise into the hundreds. 2 Thus, gathering together and compiling a diverse array of variables has become a substantial part of the labor of quantitative scholarship in international relations.
Country-year data in international relations in general and international political economy (IPE) in particular is fragmented in its sources, drawn from economics, political science, sociology, international business, and other fields. Given the low readership of most journals by scholars from outside the discipline, even identifying the most appropriate existing data sources is challenging. Once identified, merging multiple datasets together requires much labor and great care, as errors at this stage can undermine all the work that follows. Scholars of international security have long utilized EUGene for a compilation of important variables. 3 This project brings together a much broader range of variables, focusing primarily on those of interest to IPE scholars. 2 See, for example, Fariss (2014) assessing government respect for human rights, Hollyer, Rosendorf, and Vreeland (2014), or Fariss and Graham (2016) measuring the security of property rights. 3 EUGene refers to the Expected Utility Generation and Data Management Program (Bennett and Stam 2000).
Despite the diversity of topics studied, many projects in IPE, international security, and comparative politics employ the same standard set of core country-year data sets to measure economic, political and social conditions across countries and over time.
This area of overlap has only grown as economic factors have become more central to the study of international security. This creates an opportunity for large efficiency gains through coordination. Currently, the process of merging these core datasets together is repeated scholar by scholar, project by project. This resource aims to improve and streamline this data management process.
Increased efficiency in data management not only reduces errors and saves time, it encourages researchers to perform their work more thoroughly. The cost of identifying, obtaining, and preparing additional measures deters scholars from running additional robustness tests that would increase the quality of their work. All of the measures we work with as quantitative social scientists are flawed in some way; use of more alternative measures reduces the odds that that the results we publish are driven by these flaws. A centralized source for data is particularly valuable in an interdisciplinary topic area such as IPE where information about new data resources can be slow to flow across disciplinary boundaries.
Efficient data management is valuable in teaching as well as research. Professors teaching quantitative methods courses sometimes shy away from assigning "real world" data exploration exercises due to the difficulty of pulling together the necessary datasets for their students. Similarly, one of the motivations for this resource comes from a common assignment in graduate statistics courses -to reproduce and probe the robustness of a core finding in the literature. Having a dataset like this on hand from DRAFT Data Version 1.5 4 which students can draw additional/alternative regressors makes it easier for students to complete such a project in the span of a semester.
This project compiles 78 commonly used datasets drawn from a range of disciplines into an easy-to-merge format designed to speed the process of acquiring and preparing the necessary data for quantitative research projects. 4 Political institutions, economic conditions, and political violence are all tightly intertwined, and thus, while the content area of these data is focused on IPE, we expect this resource to be of interest to scholars in a variety of related fields, including security and comparative politics. To aid security scholars in particular, version 1.5 of the IPE Data Resource is scheduled for inclusion in version 1.0 of NewGene, the successor to EUGene, a software program facilitating the creation of dyadic and directed-dyadic datasets for the study of interstate war (Poast and Stam 2015). 5 We merge component datasets together on the basis of the Gleditsch-Ward country identification system, which we argue should become the standard in IPE research (Gleditsch and Ward 1999). However, the completed dataset includes Correlates of War (COW) codes and International Financial Statistics (IFS) codes. Seventy-eight datasets fall far short of comprehensive coverage of all important IPE datasets, but it is a substantial start. Critically, the Security and Political Economy (SPEC) Lab at the University of Southern California provides institutional support for the resource. 6 The 4 This count, and all details in this article, refer to version 1.5 of the data. 5 Software available at http://newgenesoftware.org [Release Pending] 6 Financial support for the lab is provided by the Dornsife College of Arts and Letters, the Center for International Studies, and the School of International Relations. Support is renewable indefinitely. dataset will be updated and expanded annually, incorporating new versions of datasets already included and adding new relevant datasets as they become available. 7

Methodology
The IPE data resource incorporates component datasets covering four general topic areas: politics (37 datasets); economics (31 datasets); social conditions (8 datasets); and geography (2 datasets). See Table A1 in the Appendix for a full list. While we expect this data resource to be quite useful to security and civil conflict scholars, this resource includes only a limited number of measures of interstate and intrastate war. 8 We largely leave it to scholars in these areas to merge in their dependent variable(s) of interest, which may be quite varied in form.
The most critical decision regarding how to compile country-year data is the choice of unique country ID numbers. Over time, countries are born, conquered, merged and divided. Following every merger or division, a time-series-cross-sectional dataset must either treat the resulting units as continuations of previous countries, in which case they retain the same ID number, or as new entities with a new number. In this project, we use the ID numbers associated with the Gleditsch-Ward system (Gleditsch and Ward 1999).
The Gleditsch-Ward system asserts continuation of political units through most mergers and divisions. Thus, modern Yemen, which was formed through the merger of 7 New data will be downloaded in January and a revised version of the resource posted in March of each year. 8 The one exception is the Major Episodes of Political Violence Data (Marshall 1999), which includes economic damage within its index of conflict severity. We view this as a conflict measure particularly appropriate for use as an independent variable in analyses of economic outcomes.
People's Democratic Republic of Yemen and the Arab Republic of Yemen in 1990, is treated as a continuation of the Arab Republic of Yemen. Imperial Russia, the Soviet Union, and modern Russia all retain the same number. When Estonia, Latvia, and Lithuania regain independence in 1991, they regain the same ID numbers they had prior to their loss of independence in 1940.
This presumption of continuity is appropriate to most of the core questions in IPE, as well as the study of intrastate violence and some (though not all) comparative politics questions. 9 In most cases, the same firms that are operating the day before a merger or division of countries continue to operate after it; debts that were owed continue to be owed; contracts remain in place; social identities and divisions persist; and in many cases even the identity of political elites and the structure of many political institutions remains the same (e.g. Acemoglu and Robinson 2006; Alston 2008). Thus the Gleditsch-Ward system treats the economic and social unit (the country) as persisting as well, albeit in altered form. However, to facilitate use of this resource by security scholars, we make available a version of the dataset that is time-series-set for COW codes, which are the dominant set of country identifiers in the study of interstate war. 10 It is this version of the data that is integrated into the forthcoming NewGene software. 9 It is notable that the Uppsala/PRIO civil conflict datasets are all based on Gleditsch-Ward numbers. 10 Time series setting refers to the process of establishing unit and time variables within Stata to identify unique time-series-cross-sectional observations. The primary version of this resource contains some observations that are duplicates by COW code and year. In particular, some component datasets include data for both Serbia (in isolation) and Yugoslavia (in total) in the same years. We can retain this information in the Gleditsch-Ward-based version of the data, where modern Serbia is treated as a continuation of the Kingdom of Serbia and as distinct from Yugoslavia, but must drop this information from the COW version, where Yugoslavia and Serbia share the same country code.

Many of the component datasets included in the IPE data resource do not include
Gleditsch-Ward numbers in their native form, and no other numbering system is hegemonic. Some datasets use IFS codes, others use COW codes, etc. Thus, we append Gleditsch-Ward IDs on the basis of country names, carefully remove duplicate observations by ID and year, 11 and then proceed to merge the datasets together.
Data cleaning involves difficult choices, and choices optimal for one research project may be inappropriate for another. Thus, it is important for scholars using this resource to be able to track and review every step that is taken from when the raw data is downloaded until the final dataset is compiled. To this end, extensive documentation is posted along with the final dataset. The codebook for this project provides the name of each dataset, a description of each variable, citations to the sources from which the data is drawn, notes regarding any data cleaning that was necessary to integrate each component dataset into the resource, and links to the relevant codebooks whenever possible. We also post the raw component data files themselves; the Stata .do files that clean these data files and append the Gleditsch-Ward Numbers (one per raw data file); and the Stata .do file that merges all of the prepped data files together. Thus, every step in the process is transparent, and scholars can review and revise any of the data cleaning choices made.
Version 1.5 of the dataset was released on August 30, 2016. All materials are available via the Harvard Dataverse Network. 12 This resource will be updated and expanded annually. 11 In this process, we remove the observation with the least information. In most cases at least one of the duplicate observations contains exclusively missing values. 12 http://dx.doi.org/10.7910/DVN/X093TV

An Illustrative Example
In 2003, International Organization featured two quantitative articles, one by Nathan Jensen and one by Quan Li and Adam Resnick, both of which used country-year data to assess the relationship between domestic political institutions and inflows of foreign direct investment (FDI). 13 In the ensuing decade, dozens more quantitative studies followed examining various political determinants of FDI inflows. The papers in this literature overlap heavily in terms of the theoretical constructs they seek to measure (e.g. democracy, government capacity, political risk), but they differ in terms of precisely how these theoretical constructs are defined and in how they are measured. While many studies employ more than one measure of key institutional variables, we argue that, across this literature, scholars do not employ alternate measures of key constructs nearly as much as they should -and nearly as much as they might if alternative measures were more readily available.
To make this case, we evaluate 21 of the most prominent articles matching two conditions: 1) Use country-year data to predict FDI inflows; 2) published between 2003 and 2013. 14 Table 1 reports the different aspects of political institutions that are evaluated in these papers, the specific measures used to capture these institutions, and the papers that employ each measure.

Number of Preferential Trade Agreements
Büthe and Milner (2008) GATT/WTO membership Büthe and Milner (2008) Bilateral Investment Treaties Too numerous to list

IMF Agreements
Jensen (2004) DRAFT Data Version 1.5 12 veto players (checks) in the Database of Political Institutions (Beck et al. 2001); and the political constraints (polcon) measure designed by Witold Henisz . 36 Though they measure the same underlying theoretical construct, each is coded differently, making each appropriate in different circumstances. Each of the three measures hails from a different discipline: xconst was created by political scientists; checks by economists; and polcon by a business scholar, thus many scholars are not even aware of all three. The pairwise correlations between them range from .64 to .79correlations that are strong, but not sufficiently so to render the choice between them moot. Similarly, Figure 2 shows substantial differences in distribution across these measures. Both polcon and checks feature a large cluster of countries in which they assess the chief executive to be entirely unconstrained -43% of the observations for checks and 39%% of the observations for polcon. Conversely, only 17% of observations receive the lowest score on the seven-point xconst scale; 31% receive the top score.
36 Polcon is omitted from version 1.5 of the resource because Henisz could not be reached to secure his permission. If permission is granted, the Polcon data will be added to the next update of the resource. In light of these important differences, no scholar should use any of these measures, even as a control variable, without fully understanding the substantive differences between each and assessing, at least for their own information, the robustness of their findings to each of the three. However, many scholars fail to test the robustness of their results to alternative measures of key variables simply because of the time required to track down, prepare, and merge together the necessary data. The more readily available alternative measures are, the higher-quality analysis scholars are likely to perform.
A fundamental contribution of the IPE data resource is to make a wide range of alternate measures easily available to scholars. The ambition of the resource is to include all or most of the variables used in statistical models of the core outcomes in international political economy -from currency crises to trade policy to human rights. With 951 variables drawn from 78 component data sources, version 1.5 already places us quite far along that path.

Existing Resources
A number of scholars and institutions have taken the step of providing catalogs of links to frequently used datasets. For example, Paul Hensel maintains the International Relations Data Site, which is essentially a catalog of links to individual datasets created by various scholars. 37 The Quality of Governance Institute (QoG) provides the resource most similar to that which we offer here, but with more of a comparative politics focus than an IPE focuses (Teorell et al 2016). Thus, the largest difference between these two resources is in the substantive topics they cover: 106 datasets are compiled by QoG and 78 datasets here, of which only 28 overlap. 38 However, these resources also differ in their approach to country codes. The QoG (rightly) takes an approach to country codes that is tailored to the study of domestic institutions and processes. Every time countries merge or divide in the QoG data, the resulting units are given new country IDs. Thus, in the QoG data, a unified Yemen is treated as neither a continuation of South Yemen or of North Yemen, but rather as an entirely new entity. This is appropriate in many comparative politics contexts because the nature of domestic politics has been fundamentally altered, but as discussed in the methodology section, it is less ideal for IPE. Thus, the IPE data resource uses Gleditsch-Ward and Correlates of War IDs instead.
The IPE Data Resource also has similarities to the data management components of the Expected Utility Generation and Data Management Program (EUGene) developed by Bennett and Stam (2000), to the World Development Indicators (WDI), which are compiled by the World Bank, the GROW up project created by Girardin et al (2015), and to the PRIO-GRID dataset (Tollefsen, Strand, and Buhaug 2012). EUGene and its fortcoming successor, NewGene, cover a range of variables often used in models of interstate conflict, while WDI covers measures of economic and human development, largely eschewing measures of political institutions. Thus a key distinction is again the substantive area of focus of each data aggregation effort. PRIO-GRID brings together a 37 http://www.paulhensel.org/data.html 38 This count was produced using the January 2016 version of the QoG Standard Data. range of economic, social, and political information on the basis of geographic space, rather than country boundaries. PRIO-GRID covers some of the same variables as this project, but it brings them together with regard to a different set of geographic unitsquadratic grid cells rather than countries.
While the IPE Data Resource is a stand-alone resource, the coming integration of this data with NewGene offers an opportunity to push forward the integration of data resources between security and IPE. While the most obvious initial benefit of this integration is to make political economy data easily available in appropriate formats to scholars of interstate conflict, the potential to build dyadic and directed-dyadic economic information (e.g. trade, investment, and migration flows) into future generations of NewGene offers the potential to, in turn, facilitate the use of detailed security data (e.g. disputes and alliances) by IPE scholars.

A Note on Citation
The component datasets compiled in this project were each constructed, through great cost and effort, by scholars and institutions that deserve credit for their work. 39 The codebook to the IPE Data Resource provides the appropriate citation for each component dataset. Any publication that draws on data from this resource must cite to the original datasets from which the variables in question are drawn. Citations to this paper (i.e. to the IPE Data Resource itself) are necessary only if the aggregation and data management efforts represented by the resource contribute to the research produced. 39 Permission to repost data was obtained from the authors of each component dataset.

Conclusion
This data resource serves the field by bringing together 78 datasets from political science, economics, international business, and sociology and carefully merging them together in a theoretically informed manner. Datasets are merged together on the basis of the Gleditsch-Ward system of country IDs, but an alternate version of the dataset employing the COW numbering system is also available. Because the project has the institutional support necessary to sustain annual updates, it will be able to remain relevant and up to date, updating existing data and incorporating new component datasets as they become available. The value of this project comes in the form of increased efficiency, increased research and teaching quality, and the diffusion of new data resources across disciplinary boundaries. As the costs of doing high quality work fall, we can all do better work and more of it.  197 1970-2014