Measuring transparency in the social sciences: political science and international relations

The scientific method is predicated on transparency—yet the pace at which transparent research practices are being adopted by the scientific community is slow. The replication crisis in psychology showed that published findings employing statistical inference are threatened by undetected errors, data manipulation and data falsification. To mitigate these problems and bolster research credibility, open data and preregistration practices have gained traction in the natural and social sciences. However, the extent of their adoption in different disciplines is unknown. We introduce computational procedures to identify the transparency of a research field using large-scale text analysis and machine learning classifiers. Using political science and international relations as an illustrative case, we examine 93 931 articles across the top 160 political science and international relations journals between 2010 and 2021. We find that approximately 21% of all statistical inference papers have open data and 5% of all experiments are preregistered. Despite this shortfall, the example of leading journals in the field shows that change is feasible and can be effected quickly.


Introduction
The Royal Society has as its motto the injunction Nullius in verba: 'Take nobody's word for it'.Yet a large portion of published studies in the social sciences demand of the reader exactly this.
Over the past several decades, open science advocates have called for the routinization of open science practices such as posting data and code upon a paper's publication and the preregistration of experiments [1].Beginning principally in

The need for open data 1.2.1. Uncovering data errors and misinterpretation
Errors in data or the misreporting of p values or test statistics inevitably occur in research and can go undetected by an article's authors or peer reviewers.These problems, if addressed, may substantively alter an article's conclusions or produce null rather than positive results.Reporting errors in regression coefficients or test statistics occur frequently [27].
Access to the original data can help determine whether errors are trivial, and contribute to retraction efforts if they are not [28].In some cases, access to the data allows for detailed concerns with a paper's analysis to be illustrated without the journal believing a retraction is warranted [29,30].

Identifying model misspecification and facilitating extension
Researchers have tremendous flexibility in deciding how to collect data and which statistical models should be specified to analyse them.Gelman & Loken termed this process the 'garden of the forking paths' [31]: some set of decisions might yield a positive result, while another set of equally justifiable decisions might lead to a null result.The replication crisis has shown that it is a mistake to view a single study or set of statistical analyses as a definitive answer to a given theory or claim-the scientific process should instead be iterative, exploratory and cumulative [32].
Open data can address the problem of model misspecification and uncertainty around modelling the data-generating process [33].Since researchers cannot anticipate changes to methodological best practices, computational reproducibility materials allow scholars to make adjustments if best practices change. 7Even if misspecification is not a problem, extending and building off of the original analyses-to run more theoretically motivated models, sensitivity analyses or assess treatment heterogeneity-are net positives for science [35].

Exposing data falsification
In the most egregious cases, open data allow researchers to investigate and expose data falsification.High-profile exposures of data falsification include the LaCour & Green [36] case in political science and the Shu et al. [37] case in psychology [38,39].Both rested on investigator access to the original data.While presumably data falsification is exceedingly rare, there is no way to know its extent given the general absence of computational reproducibility materials in the first place.

The need for preregistration 1.3.1. Distinguishing confirmatory from exploratory analysis
Preregistration means that researchers specify their hypotheses, measurements and analytic plans prior to running an experiment.This commits researchers to making theoretical predictions before they can view the data and be influenced by observing the outcomes [2,40].By temporally separating predictions from the data that test their accuracy, there is much less flexibility for both post hoc theorizing and alterations of statistical tests to fit the prediction.
Post hoc theorizing, also known as hypothesizing after the results are known (HARKing), is an example of circular logic-the researcher conducts many tests when exploring a dataset, the data reveal a relationship that can be made into a hypothesis, and that hypothesis is 'tested' on the data that generated it [41].But the diagnosticity of a p-value is in part predicated on knowing how many tests were performed: when an exploratory finding is reported as a prediction, the normal methods employed to evaluate the validity of a hypothesis-such as whether the p-value is <0.05 (i.e.null hypothesis significance testing)-no longer hold.The p values in that case have unknown diagnosticity [41].Thus, post hoc theorizing and selective reporting greatly contribute to false positives.

Reducing the selective reporting of results
The selective reporting of statistical tests and results can occur for a variety of reasons.There are numerous legitimate ways of analysing data, and this makes selective reporting seem justifiable.Danger arises when researchers convince themselves that the measures and tests lending evidence to their claims are the 'right' ones, while unjustifiably failing to report measures and tests that did not support the favoured hypothesis.
Selectively reported experimental studies often result in overconfident theoretical claims and inflated effect sizes when compared to replications.The Open Science Collaboration [3] and Many Labs studies [42,43] have shown that the effect sizes in highly powered replications are much smaller than those in the original studies.When reforms are implemented mandating preregistration, by research bodies or formats like registered reports, the number of null results reported rises [44,45].7 For instance, Lenz & Sahn [34] find that over 30% of observational studies published in the American Journal of Political Science rely on suppression effects to achieve statistical significance.Being able to determine the influence of suppression effects requires access to the original data.
The primary purpose of preregistration is to provide journal reviewers and readers the ability to transparently evaluate predictions and the degree of flexibility researchers had to arrive at their conclusions [46][47][48].It is up to the reader to determine whether preregistered studies followed their preregistration plan and adequately justified deviations-insufficiently detailed preregistration reports are an ongoing problem [49].
The replication crisis has altered best practices and changed the habits of many researchers in the behavioural sciences.As we show below, preregistration is not yet the norm in political science and international relations.The conclusions from many studies relying on statistical inference, even some that have been preregistered on a registry, remain exposed to the statistical pitfalls described above.

Methods
Our study design called for a comprehensive analysis of population-level data, yet our populations-(i) papers using data and statistics and (ii) original experiments-were embedded in a larger population of all political science and international relations publications in target journals.We downloaded all of the journals' papers from 2010 to 2021.Once we had these papers locally, we identified the data, statistical and experimental papers through dictionary-based feature engineering and machine learning.We then used public APIs, Web scraping and text analysis to identify which of the studies had computational reproducibility materials.We outline this process below.

Phase 1: gathering and classifying the papers
We used Clarivate's 2021 Journal Citation Report to identify target journals.We filtered for the top 100 journals in both political science and international relations, and combined the two lists for a total of 176 journals. 8ith this list, we used the Crossref API to download all publication metadata.We were able to obtain records for 162 journals.This resulted in over 445 000 papers, which we then filtered on Crossref's published.printfield for 2010 and onwards, resulting in 109 553 papers.We used the published.printfield as it was the only reliable indicator of the actual publication date, and the most complete. 9As of mid-2023, we were able to obtain 93 931 of these articles in PDF and HTML formats, and we use this as the denominator in the study.
We converted the PDFs to plaintext using the open-source command line utility pdftotext, and we converted the HTML files to text using the html2text utility. 10dentifying the papers that relied on data, statistical analysis and experiments was an iterative process.In each case, we read target papers and devised a dictionary of terms meant to uniquely identify others like them.We iteratively revised these dictionaries to arrive at terms that seemed to maximize discrimination for target reports.The dictionaries eventually comprised 52, 180 and 133 strings, symbols or regular expressions for the three categories, respectively. 11he dictionaries were then used with custom functions to create document feature matrices (DFMs), where each paper is an observation, each column a dictionary term and each cell a count of that term. 12he DFM format made the papers amenable to large-scale analysis.In machine learning parlance, this process is known as feature engineering.
For the first research question-examining the presence of code and data in papers involving statistical inference-we hand-coded a total of 1624 papers with Boolean categories and identified 585 that relied on statistical inference.We defined statistical inference papers as any that involved mathematical modelling of data.These include terms associated with the specification of statistical models like ordinary least squares, regression and control groups or variables. 13This definition is royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.11: 240313 meant to capture a simple idea: mathematical modelling requires computer instructions that perform functions on numbers.In the absence of computational reproducibility materials, these transformations cannot be exactly reproduced by readers.We also developed a dictionary of 35 terms for formal theory papers, because we wished to exclude papers that did not apply a model to real-world data.
For the second question-examining what proportion of experiments were preregistered-we hand-coded 518 papers with a single Boolean category: whether the paper reported one or more original experiments.We defined this as any article containing an experiment where the researchers had control over treatment.
We then trained two machine learning models-the support vector machine (SVM) and naive Bayes (NB) binary classifiers-to arrive at estimates for the total number of statistical inference and experimental papers. 14SVMs are pattern recognition algorithms that give binary classifications to variables in high-dimensional feature space by finding the optimal separating boundary between labelled training data [50,51].The NB family of algorithms calculate the posterior probability of a given classified input based on the independent probability of all the values of its features; it then applies this trained algorithm to classify new inputs [52].
We report the SVM model results both for their greater accuracy and due to our theoretical prior that the model would be more suitable for a high-dimensional classification problem.For the first research question, our SVM model achieved 92.35% accuracy for statistical papers.For the classifying experiments, the accuracy was 86.05%.In appendix A, we report the confusion matrices, hyperparameter tuning data and NB models.
The application of the SVM model to the full dataset of 93 931 publications leads to an estimate of 24 026 using statistical inference.
The identification of experimental papers proceeded slightly differently.Rather than beginning with the full corpus, we first filtered for only the papers that included the word 'experiment' over five times (4835).We then ran the SVM classifier on this subset.The resulting estimate was 2552 papers reporting experiments.

Phase 2: identifying open data and preregistrations
We attempted to identify open data resources in seven ways: -Using the Harvard Dataverse API, we downloaded all datasets held by all journals in our corpus who maintained their own, named Dataverse (n = 20). 15 -We queried the Dataverse for the titles of each of the 109 553 papers in our corpus and linked them to their most probable match with the aid of a custom fuzzy string-matching algorithm.We validated these matches and manually established a string-similarity cut-off, setting aside the remainder. 16 -We extracted from the full text of each paper in our corpus the link to its dataset on the Dataverse (1142; note this had significant overlap with the results of the first and second queries). 17-We downloaded the metadata listing the contents of these datasets, to confirm first that they had data in them, and second that they did not consist of only pdf or doc files.In cases where a list of metadata was not available via the Dataverse API, we scraped the html of the dataset entry and searched for text confirming the presence of data files. 18 -We used regular expressions to extract from the full-text papers references to 'replication data', 'replication materials', 'supplementary files' and similar terms, then searched in the surrounding text for any corresponding URLs or mentions of author's personal websites or other repositories. 19We validated these results by exporting various combinations of string matches with the above terms to Excel files, where we examined them in tabular format and validated their relevance.Given that replication and supplementary material stored on personal websites are not of the same order as material on the Dataverse and similar repositories, these results are recorded in appendix A under the rubric of 'precarious data'. 20 -We searched all of the full-text papers for references to other repositories, including Figshare, Dryad and Code Ocean, using regular expressions, where the surrounding text contained a URL fragment. 21-As additional validation for DA-RT signatory journals specifically, we downloaded the html file corresponding to each article and/or the html file hosting supplemental material (n = 2284), then extracted all code and data-related file extensions to establish their open data status. 22  We attempted to identify preregistration of experiments in the following ways: -We used regular expressions to extract from all of the experimental papers sentences that referred to 'prereg' or 'pre-reg', 'preanalysis' or 'pre-analysis', as well as any references to commonly used preregistration servers (OSF, EGAP and AsPredicted), and then searched for the availability of the corresponding link to validate that the preregistration had taken place.Parts of this process-for instance, searching author names in the Experiments in Governance and Politics (EGAP) registry to look for the corresponding paper-involved time-consuming detective work. 23 -We downloaded all EGAP preregistration metadata in JSON format from the Open Science Framework Registry (https://osf.io/registries/),extracted from this file all osf.io links and unique EGAP registry IDs and used command line utilities to search for them through the corpus of all the papers. 24  We did not examine whether the published report conformed to the preregistration plan.

Results
Statistical inference papers are infrequently accompanied by the datasets or code that generated their findings.For the 12-year period under observation, we were able to match 21% of statistical inference articles to data repositories (overwhelmingly the Harvard Dataverse).Encouragingly, figure 1 shows that the percentage of open data has increased between 2010 and 2021-rising steadily from about 11% to 26% during this period.The total number of statistical inference papers has gradually increased during the 12-year period.In 2010, we found 1329 papers and 2640 in 2020-the last year with complete data.This supports King's [53] observation that political science and international relations have long been disciplines increasingly concerned with quantitative methods. 25While the percentage of papers with open data have increased, so too have the absolute number of statistical papers without it.There are simply more published papers making inferences based on hidden data.
There are significant differences in open data practices between journals.Figure 2 displays the percentage of statistical inference papers with open data in the 41 journals with over 200 such papers. 26 The number above each journal's bin represents the number of statistical inference papers detected by the SVM classifier.Of the 41 journals displayed, 11 have over 50% open data and 16 have over 20%. 27The distribution of open data by journal in figure 2 shows the stark divide between the quarter 20 See ./code/R_original/supplementary_replication_mentions.R and ./code/R_updated/4.4_precarious-data.R. 21 Because we identified only five articles that referred to such repositories where the article was not already coded as having open data, we did not include them in the results.See ./code/R_original/test_precarious_data.R. 22 See ./code/R_original/validate_dart_analysis.R. 23 See ./code/R_updated/5.2_identify-prereg.R, ./code/R_original/count_experimental_papers.R and ./code/R_original/count_pre-reg_papers.R. 24 ./code/bash/rg_for_prereg_osf_egap_papers.sh. 25 Gary King illustrated that by 1988 almost half of publications in the American Political Science Review were quantitative.

26
The cut-off was established to focus on journals that publish more quantitative papers and for ease of viewing-the graph with all 158 journals with at least one statistical inference paper is very large and is located in appendix A. 27   The preregistration of experiments is rare in political science and international relations journals.Figure 4 shows that the first preregistered study in the dataset that we could identify was in 2013, and that the rate of preregistration only began climbing in 2016.The proportion of experiments that were preregistered for the entire period is approximately 5%; the annual rate has slowly risen to 16% in 2021.
Figure 5 shows the percentage of experiments that were preregistered in the 29 journals with more than 20 experiments.Only the American Political Science Review exceeds 20%.Unlike with open data, when it comes to preregistration the differences between journals are small.Of the experiments published in Political Psychology and Political Behaviour, the two journals with the most experiments that bridge the gap between political science and psychology, only 4% and 5%, respectively, are preregistered.
Prior to the onset of the replication crisis, beginning with psychology in the 2010s, there were no organized attempts at enforcing preregistration or using registered reports as a way of curbing researcher flexibility and its attendant QRPs.As psychology was among the first of the sciences to reckon with its methodological issues, brought to light in part by such articles as Simmons et al.'s [2], it is logical that it took several years for these new practices to be adopted in contiguous disciplines like political science and international relations.However, our data illustrate that significant improvements must be made in order for experiments in these fields to meet current methodological best practices.

Discussion
Scientists must carry out their work while simultaneously signalling and vouchsafing for its credibility.
For the pioneers of the scientific method in seventeenth-century Europe, this included an ensemble of rhetorical and social practices, including the enlistment of trusted witnesses to testify that experiments had in fact taken place as claimed-this is what Shapin refers to as the moral economy of science [54,55].
In the digital age, we argue that the credibility of social science must largely rest on computational reproducibility.The same goes for preregistration and experiments.Adhering to these practices  The chief reason for depositing code and data is not for signalling: open science practices provide the reader with an opportunity to transparently evaluate the evidence for a set of claims and scrutinize an article for any of the myriad problems that plague the use of data and statistical models.An interested reader could investigate an article's data and code for errors, determine whether results are robust to different model specifications or, in rare cases, detect data falsification.For experiments, the published paper can be compared with the preregistration document to determine whether there were any unjustified deviations.
Our findings show that political science and international relations are not currently living up to these best practices.For the approximately 25 000 statistical inference papers in the dataset, we could only identify approximately 21% that had a corresponding data repository.Despite improvement in most years, change has not been uniform across the discipline-most of the progress has been made by a handful of the highest impact factor journals.In 2020, for example, 16 out of the 52 journals with over 20 statistical inference papers had an open data percentage over 50% (figure 11)-20 journals had an open data percentage over 20%.We could not locate data or code for 2 of the 16 DA-RT signatories in our dataset.
Universal open data is a collective action problem, and it is the responsibility of journals to foster and enforce these disciplinary norms.In the absence of that, individual researchers do not always share data and requesting it can sometimes be mistaken as a gesture of challenge rather than collegiality.As Simonsohn [56] notes, the modal response to his requests for original data was that  the data were no longer available.We suspect that variation in open data practices between journals reflects differences in journal editors' views of its importance for research credibility.The DA-RT initiative sparked spirited debate in the field about the provision of data and code-but the same cannot be said for preregistration.Experiments are rarely preregistered.Of the roughly 2552 experiments in our dataset, 5% are preregistered.Given that the use of experiments only began to take off after 2015, as shown in figure 4, the proportion of preregistered experiments in the literature is understandably low.Fortunately, the trend is positive.One journal of 26 with more than 5 published experiments had a preregistration percentage of over 30% in 2020.While we do not take a position on whether all experiments should be preregistered, the percentages we identify should stimulate discussion on what the optimal percentage should be.
Identifying whether an experiment had a corresponding preregistration report was at times difficult.Numerous experiments made no mention of their preregistration report in the manuscript despite having one listed in a repository.Locating it was also difficult given the changing manuscript titles and authors.Their omission in the manuscript is probably owing to the fact that many journal editors do not determine whether an experiment has a preregistration or pre-analysis plan or request their disclosure. 29 The difficulty of matching an experiment with its preregistration report is far smaller than matching a manuscript to a concealed preregistration report.A unique and unanticipated problem we found was the authors publishing a study where they omitted any reference to a preregistered experimentostensibly due to null findings.Byun et al. [57] use their survey data to make descriptive claims while failing to discuss the design or results of their experimental manipulation [58].It is not clear whether their results failed to further their own argument or were possibly disconfirmatory.In either situation, readers are not permitted to transparently evaluate the strength of their claims.
Peer reviewers and readers of published works routinely examine whether a theory or explanation has appropriate evidence; whether the measurements are valid and reliable; whether the model has been appropriately specified.Here, we prompt referees and readers to also begin asking: (i) Are the computational reproducibility materials on the Harvard Dataverse or some other reliable repository?(ii) Is the paper computationally reproducible based on those materials?(iii) If an experiment, was it preregistered?(iv) Does the analysis in the experimental paper follow the preregistration plan and are deviations from that plan justified? 30We hope that evaluating scientific research in this manner will help move readers away from trusting research in the absence of open science practices to a more informed trust in their presence.

Appendix A Identifying papers relying on data analysis
We defined data analysis papers as those that made any display or presentation of numerical data, most commonly in tables and graphs.Maps that included data-rich overlays and required software to produce were included in this category ( royalsocietypublishing.org/journal/rsos R. Soc.Open Sci.11: 240313 of journals that have high open data rates and the three quarters which do not.In 2020, however, improvements had occurred.Of the 52 journals with over 20 statistical inferences, 16 had over 50% open data.A further 5 had over 20%.The effectiveness of the DA-RT statement on journal open data practices is illustrated in figure 3, which displays the percentage of statistical inference papers with open data by year in each of the 16 DA-RT signatory journals we consider. 28Four journals-American Journal of Political Science, International Interactions, Political Analysis and Political Science Research and Methods-had already made significant progress prior to the release of the initial DA-RT guidelines in 2014.Many of the remaining journals either made significant progress in 2016 or shortly thereafter.One caveat is that 2 of the 16 journal signatories have consistently low levels of open data even after DA-RT reforms were agreed to commence on 15 January 2016.The extent of transparent practices in three other journals-Journal of European Public Policy, European Journal for Political Research and Conflict Management and Peace Science-was more difficult to determine, given they did not use the Harvard Dataverse.Our attempt to estimate data and code availability for such journals, noted in point 7 of phase 2 of §2, appears to produce unreliable and puzzling results.

Figure 1 .
Figure 1.Open data in statistical inference papers by year.
e a c e R e s .I n t.I n te r a c t. P o l.A n a l.P o l. S c i. R e s .M e th o d s J .C o n fl .R e s o lu t.A m .J .P o l. S c i.I n t. S tu d .Q .E u r.J .P o l.R e s .A m .P o l. S c i. R e v .J .P o l.B r. J .P o l. S c i. P o l.B e h a v .C o m p .P o l. S tu d .I n t.O r g a n .C o n fl .M a n a g .P e a c e S c i. L e g is .S tu d .Q .P S : P o l. S c i. P o l.P o l.R e s .Q .E le c t. S tu d .P u b li c O p in .Q .E u r.U n io n P o l.D e m o c r a ti z a ti o n W e s t E u r.P o l.A m .P o l.R e s .J .P u b li c A d m .R e s .T h e o r y P o l.P s y c h o l.P a r ty P o l.I n t.P o l. S c i. R e v .P o li c y S tu d .J .P o l. S tu d .P u b li c C h o ic e I n t.J .P u b li c O p in .R e s .P u b li c A d m .R e v .W o r ld E c o n .J c m s : J .C o m m o n M a r k .S tu d .E u r.J .P o l.E c o n .E m e r g .M a r k .F in a n c .T r a d e A n n .A m .A c a d .P o l. S o c .S c i. J .J p n .I n t.E c o n .M a r.P o li c y W o r ld E c o n .

Figure 2 .Figure 3 .
Figure 2. Open data in statistical inference papers by journal (with over 200 papers).The number of detected papers by journal are above each bar.

Figure 7 .
Figure 7. SVM confusion matrix for experiment paper classification.
18 royalsocietypublishing.org/journal/rsosR. Soc.Open Sci.11: e a c e R e s .A m .J .P o l. S c i. J .C o n fl .R e s o lu t.B r .J .P o l. S c i.I n t.I n te r a c t. P o l.A n a l.P o l. S c i. R e s .M e th o d s E u r .J .P o l.R e s .A m .P o l. S c i. R e v .I n t.O r g a n .J .P u b li c P o li c y P o l.B e h a v .J .P o l.C o m p .P o l. S tu d .P e r s p e c t. P o l.I n t. S tu d .Q .J .E u r .P u b li c P o li c y P o l.C o m m u n .G o v e r n a n c e P o l.R e s .Q .E u r .U n io n P o l.C o n fl .M a n a g .P e a c e S c i. P s : P o l. S c i. P o l.D e m o c r a ti z a ti o n I n t.J .P r e s s .P o l.P s y c h o l.

Figure 10 .
Figure 10.Open data by journal (with over 20 statistical inference papers) in 2020.
Figure 6.SVM confusion matrix for statistical inference paper classification.

Table 1 .
Journals analysed as ranked by Journal Citation Report 2020.