Exploring a formal approach to selecting studies for replication: A feasibility study in social neuroscience

Replication of published results is crucial for ensuring the robustness and self-correction of research, yet replications are scarce in many fields. Replicating researchers will therefore often have to decide which of several relevant candidates to target for replication. Formal strategies for efficient study selection have been proposed, but none have been explored for practical feasibility - a prerequisite for validation. Here we move one step closer to efficient replication study selection by exploring the feasibility of a particular selection strategy that estimates replication value as a function of citation impact and sample size (Isager, van 't Veer, & Lakens, 2021). We tested our strategy on a sample of fMRI studies in social neuroscience. We first report our efforts to generate a representative candidate set of replication targets. We then explore the feasibility and reliability of estimating replication value for the targets in our set, resulting in a dataset of 1358 studies ranked on their value of prioritising them for replication. In addition, we carefully examine possible measures, test auxiliary assumptions, and identify boundary conditions of measuring value and uncertainty. We end our report by discussing how future validation studies might be designed. Our study demonstrates the importance of investigating how to implement study selection strategies in practice. Our sample and study design can be extended to explore the feasibility of other formal study selection strategies that have been proposed.


Introduction
Close replication of original research results is essential to increase our confidence that empirical findings are reliable (LeBel, McCarthy, Earp, Elson, & Vanpaemel, 2018;Schmidt, 2009).When a study procedure is repeated in a new sample with high similarity to the original (preferably by a novel research team), spurious data patterns can be detected and discarded.Close replication reduces research waste by preventing researchers from building on seemingly promising findings that are not true.Various formal strategies for finding the studies that need replication the most have been developed in recent years (Field, Hoekstra, Bringmann, & Van Ravenzwaaij, 2019;Isager et al., 2023;Matiasz et al., 2018).If effective, such strategies have great potential to increase the transparency and efficiency of replication study selection.When criteria for study selection are made transparent, it becomes easier to discuss which replication studies are most important to fund, conduct, and publish.Additionally, formal strategies allow researchers who agree on criteria to more easily identify and coordinate replication of high-value studies in their field.By increasing the efficiency of coordination and resource spending in replication research, formal study selection strategies present a major step forward towards the important goal of making replication part of mainstream research practice.This is especially important in human behavioral neuroscience.Research in these fields are often vulnerable to inflated false positive rates and overestimation of effect sizes due to a combination of (1) low statistical power (Szucs & Ioannidis, 2017), (2) substantial researcher degrees of freedom that inflate the Type 1 error rate (Botvinik-Nezer et al., 2020;Carp, 2012), and (3) incentives to publish statistically significant results (Button et al., 2013).Unsurprisingly, rates of successful replications are low (Boekel et al., 2015).In spite of this, close replications of original studies are not common practice (Ashar et al., 2021;Huber, Potter, & Huszar, 2019;Poldrack et al., 2017).While the evidence cited deals primarily with cognitive neuroscience, we believe these issues generalize to most areas of neuroscience that utilize imaging techniques to study the neural correlates of human behavior.At the same time, the cost of data collection in such research is high (Poldrack et al., 2017).This leads to a conundrum.On the one hand, high data collection costs make it all the more important to conduct close replications and prevent costly studies from being built on spurious findings.On the other hand, high costs limit how often replication studies can be conducted.With limited resources and many non-replicated studies to choose from, researchers in social and cognitive neuroscience should consider which studies in the published literature would be the most important to replicate, so that resources are directed towards replication can be spent optimally.
However, no formal study selection strategies have been tested for application in human behavioral neuroscience.To be applicable, a strategy must meet two basic conditions.First, it must be feasible to apply the strategy in practice.That is, the information needed to execute the strategy must be possible to obtain given reasonable time and resource constraints.
Most formal study selection strategies are based on a combination of statistical, bibliometric, and substantive information about the candidate studies, which is often not easy to access (Federer et al., 2018;Furukawa, Barbui, Cipriani, Brambilla, & Watanabe, 2006;Glasziou, Meats, Heneghan, & Shepperd, 2008;Sullivan & Feinn, 2012;e.g., Tay, Kramer, & Waltman, 2020).The feasibility of existing strategies for application in any particular area of research is therefore uncertain.Second, provided that the strategy is feasible to apply we must validate that the strategy is actually helping us reach prespecified research goals.All feasible selection strategies lead to some prioritization of studies, but whether this prioritization has any validity and practical utility is an empirical question.
In this article we explore how feasible it is to apply a particular replication study selection strategy to fMRI research in social neuroscience, hence no part of the study procedures or analysis plan was preregistered prior to the study being conducted.We focus on a strategy previously developed by the first, second, and last author (Isager, van 't Veer, & Lakens, 2021).This main advantage of this strategy over potential alternatives is that it is (in theory) easy to apply, because it only requires information about the sample size and article citation count of each study that is considered for replication.It should therefore also be possible to apply the strategy to large bodies of research, such as all fMRI studies in social neuroscience.However, the strategy has never been utilized in practice, leaving many questions of practical application open.The goal of this article is therefore to apply the strategy proposed by Isager et al. (2021) in practice, to explore important implementation questions and identify real-world challenges and limitations that are so often overlooked in theoretical analyses.
We focus on fMRI research in social neuroscience because replication studies in this field are both scarce and costly, and because of all methods and areas within the neurosciences, this is what the first and last author are the most familiar with.In other words, the field of social fMRI was chosen because we believed it would provide a sensible test context for the study selection strategy.It is not our aim to study the relative need for replication studies in social neuroscience versus other areas of neuroscience.The goal is simply to provide a case study for testing the feasibility of our selection strategy within the realm of human behavioral neuroscience.We reflect on the generalizability of our conclusions to other research areas in the general discussion.

2.
A four step approach to select studies for replication The concept replication value is defined in the formal decision model for replication study selection proposed by Isager et al. (2023) that has been developed to select which empirical claims in the scientific literature to replicate.According to this model, the goal of a replication effort is to maximize the expected utility of knowledge gained.Expected utility gain can be approximated by the replication value of the target claim we want to replicate.In this model replication value is a function of the value (or importance) of having accurate knowledge about the target claim, and our uncertainty about the truth c o r t e x 1 7 1 ( 2 0 2 4 ) 3 3 0 e3 4 6 status of the claim based on available evidence prior to replicating.Research claims that are highly valuable or important, and about which we are highly uncertain, will have a high replication value, and should be prioritized for replication in order to maximize expected utility gain.Isager et al. (2021), propose a quantitative method for estimating replication value in which value is operationalized as the average yearly citation impact of the article in which a claim is reported, and in which uncertainty is operationalized as the sample size used to investigate the claim.Replication value is then operationalized as the indicator RV Cn : where RV Cn denotes a particular operationalization of replication value, C stands for citation impact, n stands for the total number of participants included in the study, w(C S ) stands for the weighting function that should be applied to the citation impact (such as removing self-citations or not), s denotes the source the citation data is retrieved from, and Y stands for the age of the article in years.The equation assumes that average yearly citation impact is causally influenced by scientific impact, and that scientific impact partly determines the value of a claim.It also assumes that sample size partly determines the standard error of estimates in a study, which in turn partly determines the uncertainty about claims studied.Although both the average citation per year and the sample size are imperfect measures of value and uncertainty, our auxiliary assumption is that they are sufficiently correlated with value and uncertainty to generate a useful initial rank order of replication value.
RV Cn is embedded in a four-step procedure for replication study selection based on RV Cn (see Fig. 1).In the first step a set of candidate studies is identified based on the research interests and resource constraints of the replicating researcher.As with every systematic review of the literature, the scope needs to be broad enough to encompass all claims of interest to the researchers, but narrow enough so that the review process remains feasible.In the second step RV Cn is calculated for each study included in the set to create an initial estimate of rank-order expected utility gain.In the third step a subset of the studies with the highest RV Cn is inspected in-depth by reading the article.This step functions as an additional check of the RV Cn estimates, and has as the primary goal to evaluate additional factors relevant to replication value (e.g., Field et al., 2019;Heirene, 2021;KNAW, 2018).In this step researchers can also evaluate the feasibility of a replication study given the resources they have available, and the extent to which a replication study will be able to reduce uncertainty about the effect (Isager et al., 2023).Finally, in this step researchers can check if the article is cited for its empirical claim, and remove replication candidates if the article is cited for other reasons (e.g., the use of a new method, or proposing a new theoretical idea).In the fourth step the candidate deemed most worthwhile to replicate is selected.Alternatively, if the researcher thinks the subset of studies that has been inspected contains no candidate that is worth replicating or feasible to replicate, step 3 and 4 can be repeated for a second subset of studies.We recommend that researchers register their literature search (e.g., using PROSPERO), as well as the replication value formula they will use, and specify as well as they are able any selection criteria for the manual screening phase.This should prevent concerns about the opportunistic use of inclusion criteria to end up with a desired set of studies with a high replication value.

3.
The current study e exploring the feasibility of using RV Cn for study selection in social fMRI research RV Cn was developed to enable more efficient coordination of replication efforts.However, it is not clear whether it is feasible to use RV Cn in practice for study selection in a research area such as social fMRI research.Our exploration focuses on the first two steps of the four-step procedure listed in Fig. 1.We report the results of our attempt to implement these steps in practice, including our method for collecting a sample set of replication candidates (step 1), and more importantly, our method for collecting the citation impact and sample size data necessary to calculate RV Cn , the reliability of our methods for generating accurate measures of citation counts and sample sizes, and the distribution of RV Cn for our set of candidates (step 2).In supplementary materials we also summarize our unsuccessful pilot efforts to collect additional quantitative information related to the main finding for each candidate studies in our set.Where a main finding could often be identified based on the abstract, it proved too difficult to identify which statistical test was the basis of this main finding.Finally, we also provide a brief qualitative evaluation of the recommendations produced by RV Cn to better understand what sort of studies are being recommended, what the boundary conditions of this study selection strategy are, and to understand the factors one might want to evaluate when implementing step 3 and 4. We report how we determined our sample size, all data exclusions (if any), all inclusion/exclusion criteria, whether inclusion/exclusion criteria were established prior to data analysis, all manipulations (there were none), and all measures in the study.We conclude the article by generating hypotheses for studies that could be undertaken to test the validity of RV Cn .

3.1.
Step 1 e determining an initial set of candidate studies

Eligibility criteria
To test the feasibility of calculating RV Cn we first set out to determine a suitable set of candidate articles.This step is similar to any systematic literature review (e.g., a metaanalysis).We restricted our search for studies to fMRI research within social neuroscience between 2009 and 2019 at the time this decision was made.Although there is no need to restrict study selection to a specific time period, we reasoned that researchers might be especially interested in conducting replications of studies within a relatively recent time window to prevent unproductive follow-up research (when the original research is non-replicable) or stimulate follow-up research (when the original research is replicable).

Search strategy
We used the Web of Science (WoS; www.webofknowledge.com) database to construct our candidate dataset.WoS does not have a predefined field category for social neuroscience.To identify articles related to social neuroscience, we implemented a two-pronged search strategy on 2019-02-21.We first identified four journals in the WoS database as social neuroscience journals (Social Cognitive and Affective Neuroscience; Social Neuroscience; Behavioral Neuroscience; and Socioaffective Neuroscience Psychology).Empirical articles published in these journals were identified by submitting the following search term to WoS: (SO¼(social neuroscience OR social cognitive and affective neuroscience OR behavioral neuroscience OR socioaffective neuroscience psychology) AND PY¼ (2019OR 2009OR 2018OR 2017OR 2016OR 2015OR 2014OR 2013OR 2012OR 2011OR 2010)) AND DOCUMENT TYPES: (Article) Timespan: 2009e2019.Indexes: SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, ESCI.
To identify social neuroscience articles in general topic journals we searched the entire WoS database for articles containing the keywords "social" and "fMRI" in all fields.Empirical articles containing the relevant keyword information were identified by submitting the following search term to WoS: ALL FIELDS: (fmri AND social) Refined by: DOCUMENT TYPES: (ARTICLE) Timespan: 2009e2019.Indexes: SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH, ESCI.

Selection process
The two search strategies yielded overlapping results.After removing duplicate records, the two search strategies yielded 7413 unique empirical articles in total (see Fig. 2).Basic bibliometric information about each article, including authorprovided keywords, were downloaded for all articles.
Authors PMI and AvtV reviewed the initial set of articles and excluded articles they did not believe would be feasible to replicate given their expertise and available resources, which meant excluding animal model research, highly invasive study designs, imaging methods outside our area of expertise, research on patient groups, and other keywords signaling the study would require highly specific samples, procedures, or technologies to perform.At this stage, exclusion criteria were not predetermined, but were exploratorily derived through inspecting keyword information in our initial candidate set.Note that if future replicators want to apply this step of selecting a candidate set it can be done in a number of different ways depending on their specific interest or expertise.For our decision rationale for each excluded keyword, a written record is made openly available on OSF (https://osf.io/mtx72/).Our final set of candidates contained 2268 empirical articles.

Exploration of sample representativeness
Once the final set of candidate records was determined, we explored the available bibliographic information to ensure that the sample indeed consisted of the field of studies using fMRI in social neuroscience.The full dataset, including all bibliometric variables and a variable codebook, are available on OSF (https://osf.io/f7zdq/).The articles included in our dataset were published in 329 unique journals, consistent with our expectation that social neuroscience is a broad and loosely connected discipline of researchers from many subfields, who publish in a variety of specialty-and general-topic journals.Table 1 displays the name and frequency of the 20 journals most frequently published in (70.99% of all articles in the set were published in these 20 journals).
We used the statistical visualization software VOSviewer (van Eck & Waltman, 2010) to extract commonly mentioned terms from the titles and abstracts of all studies.Additional analyses of keywords retrieved from the Centre for Science and Technology Studies (CWTS, https://www.cwts.nl/)are reported in supplementary material SM1.All data included in the initial candidate set were subjected to analysis in VOSviewer (co-occurrence map with parameters set to binary counting, minimum number of occurrences set to 15, maximum number of keywords set to 200.Age-related and generic terms were excluded.The list of excluded keywords and map files to recreate the reported co-occurrence map can be found on OSF: https://osf.io/f7zdq/).Fig. 3 displays the cooccurrence map between commonly mentioned keywords in our dataset.
The VOSviewer co-occurrence map corroborates that themes commonly studied in social neuroscience frequently co-occur in the titles and abstracts of articles in our data.Further, the analysis shows that individual topics could be organized into larger categories based on keyword cooccurrence clusters [represented as keyword colors in Fig. 3; van Eck and Waltman (2014)].As expected from a set of articles sampled from social neuroscience, these categories center around themes such as face perception (purple cluster), judgment and decision-making (green cluster), language (red cluster), and social pain/ostracism/exclusion (blue cluster).
The default mode network (yellow cluster) also has clear ties to social neuroscience research (Li, Mai, & Liu, 2014).
Converging lines of evidence suggest that our search strategy and selection process was successful in curating a  x 1 7 1 ( 2 0 2 4 ) 3 3 0 e3 4 6 dataset both representative of, and exclusive to our target population of healthy human social fMRI research.Note that our sampling and selection process was largely constructed to overcome the problem that social fMRI is not a well-defined bibliometric category.Determining an initial set of candidates will likely be more straightforward when the field of interest aligns more closely with a well-defined bibliometric category (e.g., a WoS field category) or search terms related to more narrowly defined researcher interest and/or expertise.
We subsequently set out to quantitatively estimate the replication value for each study in this set (see Fig. 1, step 2).Following Isager et al. (2021) we chose RV Cn as our operationalization of replication value (see equation under Section 2).To quantify the replication value, researchers need to specify what function w should be used to weigh the citations, which type of citation impact C is used, as well as source S of that citation impact, if multiple sources are available.In the sections below we explain how citation impact and sample size data were collected in practice, and we explore the reliability of the collected data.

Operationalizing value as citation impact
To explore the impact of choosing one specification over another, we studied the reliability of citation impact estimates across a range of impact types C, sources S, and functions w.
Although changes to these values will immediately impact the absolute replication value that is calculated, we are mainly interested in their impact on the relative ranking of studies in terms of replication value.Two qualitatively different types of citation impact C were collected; traditional academic citation indexes and Altmetric attention scores.Altmetric attention scores were collected using the rAltmetric package in R [Ram (2017); download date: 2020-10-30].Altmetric attention scores are a weighted count of news-and social-media attention an article has received.For traditional citation impact, we collected data from multiple sources, including WoS (collected 2020-11-07 using the WoS web interface), Crossref [collected 2020-10-30 using the rCrossref package in R; Chamberlain, Zhu, Jahn, Boettiger, & Ram (2020)], Scopus [collected 2020-10-30 using the rScopus package in R;

Operationalizing uncertainty as sample size
Following the rationale of Isager et al. (2021) we operationalized the uncertainty about a claim before replication in terms of the standard error of effects supporting the claim.The standard error can be computed based on the standard deviation and the sample size, which is a combination of the number of participants and the number of observations per participant.We originally aimed to collect multiple sources of information that are relevant to quantifying the uncertainty such as information about the statistical test and the test results (e.g., the standard deviation of the dependent variable), the experimental design (e.g., the number of trials), the number of existing replications, etc., as such information can be used to compute and evaluate alternative operationalizations of replication value.This information would allow us to compare estimates from the RV Cn indicator with other proposed indicators of replication value (e.g., Field et al., 2019, which requires information about bayes factors).We performed two pilot studies to 1) identify additional information that could be coded to quantify uncertainty, and 2) examine if this information could be efficiently coded (see supplementary materials SM2 and SM3, respectively).From these pilot studies we concluded that it was possible to identify the main claim in a paper (which was often feasible based on the abstract), but that it was not feasible to identify the results of the statistical test that provided empirical support for the main claim.It was often not possible to identify which of many statistical tests authors reported were the basis of the main claim.The main reason for this difficulty was the fact that the verbally stated hypotheses were often too ambiguously connected to the reported statistical tests, making it difficult to identify which statistical results would corroborate or falsify the main claim of the paper.This problem is frequently experienced when coding claims and the corresponding tests from the scientific literature (Edelsbrunner & Thurn, 2020;e.g., Scheel, 2022).Furthermore, statistical results were often not reported in sufficient detail to extract information (e.g., about the standard deviation of the measure).We concluded that it would not be feasible to collect additional information related to the uncertainty of the claim on a large scale from the social fMRI literature.In the end, the number of participants was the only operationalization of uncertainty we were able to move forward with in this study.This is an approximation of uncertainty that ignores variation in standard deviations, and the number of trials in a study.In addition, we did not identify replication studies of the studies in our candidate set.It is reasonable to assume that some studies in our set have been replicated.Replication studies should normally reduce our uncertainty about a claim, and methods for incorporating replication information in the RV Cn estimate have been developed (Isager et al., 2021, Supplementary material 1).However, because original and replication studies are not systematically connected in the bibliometric record, and because we believe replication studies to be quite rare within social fMRI research anyway, we elected not to code such information in this study.
3.2.3.Collecting and inspecting the reliability of RV Cn input 3.2.3.1.RELIABILITY OF CITATION IMPACT ACROSS SOURCES.To better understand the relationship between different variables related to the citation impact C across sources S, we explored the strength of the association between a variety of citation metrics (Table 2).All metrics were retrieved within a time span of two weeks to prevent differences due to a time-lag.Fig. 4 displays the distributions of all citation metrics.All metrics are heavily right skewed.The distributions of raw citation counts are highly overlapping across sources (Fig. 4A).CWTS citation counts are more heavily skewed towards zero than raw counts from other metrics, likely due to the fact that CWTS subtracts self-citations from the total citation count.
To examine how strongly WoS, Crossref, Scopus, CWTS, and scite™ were correlated measures of the same underlying construct -the raw academic citation impact of an article e we subjected the citation data from these sources to an intraclass correlation analysis [model ¼ two-way fixed effects, type ¼ single rater, definition ¼ consistency; Koo and Li (2016)] using the ICC function in the R package psych [Revelle (2021); ICC3 output reported].Because all citation metrics have a skewed distribution, and because we are primarily concerned with the rank-ordering of studies we retrieved citation metrics for (Isager et al., 2021) Spearman's rho correlation was used to assess the strength of association.
Fig. 5 displays the rank-order correlations between various citation metrics.The correlation between raw citation counts from any two sources was very high (always >.94).The interrater reliability between these metrics was similarly high, ICC ¼ .97,CI 95% [.96, .97].When self-citations are subtracted, as is done in the CWTS citation counts, correlations are only slightly lower compared to intercorrelations between the other sources, suggesting that self-citations will not have a large impact on the computation of a replication value.
As expected based on the prior literature (Costas, Zahedi, & Wouters, 2015) the correlations between Altmetric scores and all other metrics were consistently low.The correlation between normalized and non-normalized citation counts was consistently high across sources, though substantially lower than the inter-correlation between different raw citation counts.As will be discussed in more detail below, this c o r t e x 1 7 1 ( 2 0 2 4 ) 3 3 0 e3 4 6 suggests that it matters little for RV Cn estimates which source S is used, but it does matter whether one chooses raw or fieldnormalized citation count as the operationalization of wðCÞ, and it would matter substantially whether one chooses to use traditional citation count or news/social-media impact as the operationalization of C. The reliability of Altmetric attention scores as estimates of news/social-media impact remains unclear, as we had no other metrics for this kind of impact to compare against.We will examine the consequences of using Altmetric scores or field-normalized citation counts on the computation of replication value scores below.

ACCURACY AND UNBIASEDNESS OF AVERAGE YEARLY CITATION
COUNT.The ideal citation metric of RV Cn is the number of future citations an article will receive (Isager et al., 2021).Total citation count is not a useful estimator of future citation impact because citations accumulate over time.As an article gets older it will tend to get a higher total citation count.This could mean that a 50 year old article cited once per year has the same total citations as an article published last year that has been cited 50 times, even though we should expect the latter to have much more impact on the field in the future.To prevent age from impacting the replication value of articles, RV Cn uses the average yearly citation count instead of the total count as an operational measure of value.
To examine how well average yearly citation count predicts future citation count we obtained the yearly citation rate for each year separately from scite™, including the citation counts for 2020.Then, with the average yearly citation count of each article from all years until 2019, we predicted the citation rate of each article in our data for year 2020 (the last complete year in the data from scite™).To examine whether average yearly citation count is a sufficient approach to correct for the effect of age on citation counts we examined the correlation between age and average yearly citation count.In addition, we explored the relationship between age-averaged citation count and age/field-normalized CWTS citation count, which are age-adjusted using the superior method of normalizing the citation count against all articles from the same year.If age-averaging is an effective method for age adjustment, age-averaged citation count should correlate more strongly with CWTS normalized scores than raw citation count.Finally, we also examined the effect of age-averaging on Altmetric attention scores.Our goal in examining the relationship between these variables is to gain a better Table 2 e Frequency of various citation metrics available for our data.Web of Science citation counts were originally available for all articles, but some could not be retrieved when the citation count data was updated in 2020.understanding of which data should be used to quantify the value of a published study.We focus on scite™ citation count data in these analyses since it was the only source from which we could obtain data on yearly citation rate.However, the reported pattern of results is highly similar regardless of which citation source is used (see supplementary material SM4).

Citation metric
3.2.3.2.1.PREDICTIVE ACCURACY.Fig. 6A displays the scite™ citation rate trajectory for all articles in our data.Fig. 6B displays the same trajectories on a logþ1 scale with box plots summarizing the distribution for each year since publication, which gives a better sense of the overall trend.On average, most articles seem to be cited at an increasing rate for about the first two years since publication.Then the citation rate stabilizes, possibly increasing slightly around year ten.Given this general trend, our auxiliary assumption that average yearly citation count is on average a useful predictor of future citation impact is supported.Including citations from the two first years seems to lead to an underestimation of the citation rate in later years, but this might not directly affect any rankorder of citation counts.
Fig. 6C displays the accuracy of average yearly citation count (using data until 2019) to predict the "future" citation count in 2020.Predictive accuracy is quite good, but far from perfect, r ¼ .75,CI 95%[.72,.77].As noted above, average yearly citation count consistently underestimates how many citations are obtained in 2020.The two first years since publication are included in the average yearly citation count, which tends to drag down the average.Also as expected, underestimation of citations in 2020 seem to be particularly severe for more recently published articles (more yellow dots above the line).The younger the article, the more its average yearly citation count is influenced by the relatively lower number of yearly citations in the two first years since publication.Because total citation counts obtained from scite™ were highly correlated with total citation count obtained from other sources, we believe the results reported here likely generalize to citations from WoS, Crossref, Scopus, and CWTS.The results suggest that the predictive accuracy of RV Cn could be improved by excluding citations from the first two years since publication.Alternatively, accuracy could be improved through more accurate modeling of each article's citation trend.Such improvements require data on citations per year, which is not easily accessible to most researchers [the information was provided to us by scite™ (www.scite.ai)].

PREDICTIVE UNBIASEDNESS. Article age was very
weakly correlated with the number of scite™ citations an article received from january to december of the year 2020, r ¼ .07,CI 95%[.02,.11],suggesting article age is not a determinant of future citation impact and can safely be corrected for.To examine how well age-averaging corrects citation estimates for age, we computed pairwise spearman correlations between publication age, scite™ citation count, Altmetric scores, scite™ citation count divided by years since publication, Altmetric scores divided by years since publication, and CWTS normalized citation count.Fig. 7 displays the correlation coefficients between all variables of interest.Not surprisingly, there was a strong correlation between age and raw scite™ citation count, r ¼ .54,CI 95% [.51, .57].The correlation between citations and age dropped substantially when citation count was divided by years since publication.However, a small residual correlation between average yearly citation rate and publication age remains, r ¼ .12,CI 95% [.07, .16].This suggests that dividing total citation count by the number of years since publication is an imperfect age adjustment method, but the correction substantially reduces the correlation between age and citation count, and is therefore a substantially better measure than total citation counts.Averaging over age works best if citation time accumulates at a constant rate, but this rate is quite variable for most articles (Fig. 6A).Encouragingly, however, averaging citation count by age does increase the correlation between citation count and CWTS normalized scores, whose method of age correction is superior as it corrects for the average number of citations of all publications published in the same field and in the same year.Interestingly, even CWTS scores are weakly positively correlated with age, suggesting that perfectly adjusting for article age is challenging.In summary, taking the average yearly citation count seems to be an imperfect but efficient method for age adjustment in traditional citation metrics.

CODING NUMBER OF PARTICIPANTS. The number of partici-
pants for each study in our dataset was coded manually.Manually coding the number of participants for all studies in the full set of 2268 candidate articles was assumed to be costly and time consuming from the outset.In practice, we expect most researchers to have more narrow inclusion criteria when computing the replication value for a set of replication targets.For feasibility reasons, we aimed at coding 1000 articles at random from the full set of 2268 articles and began the process of splitting these into individual studies for coding the number of participants.While coding, it became clear that many studies did not meet our inclusion criteria.To ensure we would end up with at least 1000 articles we oversampled with an additional 500 articles drawn at random from the full set.The exact code used to draw the sample is available on OSF (https://osf.io/rxukq/).After removing articles that matched our initial exclusion criteria (e.g., single non-fMRI studies from multi-study articles, such as De Vries, Fennis, Bijmolt, Ter Horst, & Marsman, 2018, study 4) the number of participants was coded for each fMRI study in the article.
Coding was performed by a team of three undergraduate research assistants.For each article we identified the number of studies reported in the article.For each study we recorded the number of participants who contributed any fMRI data to analyses reported in the study (even if their data were excluded from some analyses).For further details about how coders were instructed to proceed with coding the number of participants, see the supplementary coding instructions (https://osf.io/j3pxf/).
The 1500 articles contained 1681 individual studies, of which 323 matched our exclusion criteria.The final dataset contained 1358 individual studies from 1283 unique articles.Coding time was a few minutes when the number of participants and exclusion criteria were clearly summarized in either the study abstract or the "participants" subsection of the methods section, but could take longer if reporting was less structured.In order to ensure that the number of participants was reliably coded, a subset of 250 studies, randomly selected from the larger set of 1358, were double-coded by independent coders and subjected to an inter-rater reliability analysis.Two additional coders (one additional undergraduate student, the undergraduate coder, and the first author, the PhD coder) re-coded the number of participants for each study in this subset.While coding, all coders were blind to the number of participants provided by other coders.To study inter-rater reliability, we subsequently calculated the percentage agreement between each of the coders, and we calculated the intraclass correlation coefficient between coders (model ¼ one-way fixed effects, type ¼ single rater, definition ¼ absolute agreement) using ICC function in the R package psych (ICC3 output reported).Overall, there was a high but imperfect agreement between the three coders (percentage exact agreement ¼ .77).The intraclass correlation coefficient between raters was high, ICC ¼ .82,CI 95% [.79, .86].Fig. 8 displays the variation in sample size between the coders, plotted on log scale.
Coders disagreed in 57 cases.All disagreements between coders were resolved by the PhD coder after inspecting comments by the other coders.In addition to the cases of disagreements identified in the data used for inter-rater reliability analysis, one additional sample size coding error in the full set of 1358 studies was detected and corrected at a later time during the analyses.Fig. 9 displays the distribution of sample size in our data after resolving coder disagreements

Calculating and comparing alternative operationalizations of RV Cn
Having established that sufficiently accurate citation counts and the number of participants can be collected, we proceeded with the calculation of RV Cn .Because replicating researchers may end up relying on any of several citation metrics to estimate value, we decided to compare the results of several alternative operationalizations of replication value; one indicator measured value via the WoS citation count of the articles (RV WoS ), one via the Scopus citation count (RV Scopus ), one via the field-normalized citation counts (RV tncs ), one via the RV scite and one indicator measured value via    Altmetric score of the articles (RV Alt ).All indicators used sample size as a measure of uncertainty.
RV WoS was based on the equations derived by Isager et al. (2021), and calculated in the following way: where C WoS denotes the WoS citation count of the article a study is reported in, Y denotes the article age in years, and n denotes the sample size of the study after exclusion.The three measures using Scopus, scite™, and cluster-normalized citation scores were computed in the same way as RV Wos .
RV Alt was calculated in the following way: where C Alt denotes the Altmetric attention score of the article, and n denotes the sample size of the study after exclusion.Because the analyses above revealed that Altmetric attention scores are not strongly correlated with article age in our data, we did not average C Alt over publication year in this replication value indicator.Many articles are not mentioned in any sources that are tracked by Altmetric, and therefore have a score of 0. In our dataset C Alt could only be calculated for 1156 of 1358 studies.Importantly, we calculated all replication values under the assumption that no study in our candidate set is a replication of another study in the set, implying that no studies should be combined in the estimate of n.Because lack of replication research in fMRI research (Poldrack et al., 2017) implies that only very few articles in our dataset would be replications of one another, we found it acceptable to proceed with calculation under the assumption that there were no replications in the data.Where direct replication studies have been performed, it would have been more appropriate to combine the sample size from the original study and it's replications (Isager et al., 2021, supplementary material 1).However, there are no databases that store information about direct replication in social neuroscience.Whenever researchers compute the replication value for a more specific population, information about direct replications might be more readily available, or it can be manually searched and coded in step 3.
The distribution of replication value from all indicators was visually inspected, and estimates from indicators were correlated to study their similarity.Spearman's rho was used since the rank-order correlation between different indicators is of primary interest.95% bootstrap confidence intervals were calculated for the correlation estimate using the spearman.cifunction of the RVAideMemoire package in R (Herv e, 2021).
Fig. 10 displays the distribution of RV WoS , RV Alt , RV Scopus , RV tncs (field-normalized citation scores), RV scite , and their associations with RV WoS .Overall, all distributions are highly skewed with most scores distributed around low values, which is expected given that the number of participants, citation counts, and Altmetric attention scores are all highly skewed as well (see Figs. 4 and 9).Overall rank-order correlations were high for different citation sources (WoS, Scopus, scite), lower for field-normalized citation counts, and low for Altmetric scores (see Fig. 11).As a consequence, only two studies (Kassam, Markey, Cherkassky, Loewenstein, & Just, 2013;Tamir & Mitchell, 2012) were ranked among the top ten in both WoS and Altmetric rank-orderings (purple-colored points in Fig. 10).The same was true for field-normalized citation scores, where the overlap between top-ranked studies using WoS citation scores and field-normalized citation scores was very low (despite the relatively high correlations between the two measures).Traditional citation impact and altmetric attention scores are generally thought to measure different aspects of impact and are known to be weakly associated.It is clear field-normalized citation scores also measure impact in a substantially different manner than raw citation counts.The overlap between citation counts from different sources such as scite™ or Scopus does not lead to substantially different selections, even though even there some variation in the last one or two studies included when selecting the X highest ranked studies (e.g., the 9th and 10th study included in a Top 10) should be expected to vary.
To conclude, quantitative recommendations for which studies to replicate will vary substantially based on whether traditional, field-normalized, or altmetric citation impact is used to estimate replication value, because these impact metrics measure non-overlapping aspects of scientific impact.Different stakeholders may prefer either operationalization, depending on what aspects of impact they find most relevant.Altmetric attention scores are only weakly correlated with traditional citation counts, which has a substantial impact on RV Cn estimates.

3.3.
Step 3 e In depth review of recommended candidates The next step when selecting a replication target is an indepth inspection of studies with a high replication value.For our exploratory purposes, we expect such an in-depth review to reveal certain boundary conditions of when the number of participants and/or the citation count do not accurately reflect the value and impact of a study.We subjected the 10 studies with the highest and lowest replication value on either RV WoS or RV Alt to an in-depth inspection.In addition, we included the 10 lowest non-zero estimates from the RV WoS distribution, because RV WoS scores of 0 often simply reflect a paper too young to have picked up citations yet.In total, 44 unique studies were included in our face validity review (6 studies were among the highest or lowest scores for both indicators).
We wanted to see whether quantitative replication value estimates would conform to our own intuitions about replication value, and identify factors that would lead to a high replication value using a formula-based approach, without actually warranting a replication.Such boundary conditions are likely present in other sets of replication targets as well, and identifying such factors will help researchers during the in-depth inspection in step 3.For example, an article may be highly cited for reasons other than the empirical studies it reports, which would lead to a highly cited paper while the study in the article is not worth replicating.As such, the goal is to identify potential issues with validity, reliability and measurement error that future validation studies of RV Cn may want to follow up on.
Authors PMI and AvtV read the title and abstracts of all studies included in the review, consulted the article text intermittently for clarifications, and reviewed quantitative information related to the replication value estimates of these studies (i.e., reviewers were not blinded to a record's rank position).Both reviewers first made notes for each study in private, focusing on their intuitive validity judgment of the replication value estimate and on potential sources of error and bias.Notes were then discussed by PMI, AvtV, and DL in two meetings to distill the most central outcomes of the review effort.The full set of notes is available on OSF for author PMI (https://osf.io/vwpqs/)and AvtV (https://osf.io/953rh/).

Central outcomes of the review process
The in-depth review yielded several insights.A detailed inspection of quantitative replication value estimates is important for quality control.In two studies, coders had erroneously coded an incorrect number of participants (due to a transcription error, and overlooking data exclusions).Eight articles turned out not to be connected to social neuroscience, and one study did not utilize fMRI for imaging.Finally, in one case we had incorrectly labeled a single two-session repeated measures study as two separate studies.Together, these studies make up one quarter of the entire sample selected for review.This clearly indicates that, in this particular context, RV Cn is a noisy measure of replication value, and finding the studies most in need of replication is highly dependent on the third step of the procedure.
There was not always an intuitive correspondence between the RV Cn rank order and our intuitions about the replication value of the claims purely based on the title and information in the abstract.One reason for this lack of correspondence may have been that reviewers were not blind to the replication value ranking, and had access to the citation count and number of participants, which were so salient they were difficult to not take into account.Another reason was that without other explicit criteria to determine the value of a replication study, there was substantial subjectivity in the value of each study as judged by both reviewers.This is not unexpected, as peer evaluations of the value of a study are variable, and not strongly related to eventual citation scores (Gottfredson, 1978).A final reason for the low perceived correspondence between indicator estimates and reviewer intuitions were a number of boundary conditions where the RV Cn estimates did not accurately reflect the value and uncertainty of the studies.
The first boundary condition was that many studies used within-subject designs, where the number of participants does not fully capture the uncertainty, as it ignores the number of measurements per participant.The use of within-subject designs seemed to be common among the highest ranked studies, as such designs require less participants for high statistical power, and therefore get a higher replication value when uncertainty is based only on the number of participants.This is clearly an important limitation, especially when the number of trials in each study varies substantially between studies (as was the case in the set of studies we examined).In future applications of RV Cn -based study selection we therefore recommend that uncertainty is quantified during step 2 based on both the number of participants, and the number of observations per participant.If this is unfeasible (which is likely given how unsystematically this information is reported in the literature), the number of observations should be taken into account during step 3 (see supplementary material 2 in Isager et al., 2021 for technical details on such a correction method).Alternatively, selecting a narrower set of candidates with homogeneous study designs in step 1 will alleviate this limitation.Another boundary condition concerned a study that already had been replicated in the literature.Although rare, when replication studies already exist, the replication value should be computed based on the uncertainty remaining after all replication studies (Isager et al., 2021).
Other boundary conditions concerned the reason why an article was highly cited.One article containing both a literature review and an empirical study seemed to be cited primarily due to the literature review (Dimoka, Pavlou, & Davis, 2011).Another study on human navigation appeared to receive a large Altmetric score primarily due to speculative news reports claiming that GPS use can "turn the brain off" e even though this conclusion did not follow from the study (Javadi et al., 2017).A replication of the study results would do little to avert such speculations, since the speculations are not grounded in the actual study results.The boundary conditions identified so far seem general enough to incorporate in the indepth review process of replication targets by default.Future research should give us a better understanding of which additional factors to consider during in-depth review of replication candidates (e.g., Pittelkow et al., 2023).

General discussion
The overall aim of this exploratory study was to test the feasibility of implementing the four-step replication study selection procedure based on RV Cn proposed by Isager et al. (2021) in a large body of social fMRI research.The current exploratory report shows the importance of testing the feasibility of proposed selection strategies, as well as carefully examining possible measures, auxiliary assumptions, and boundary conditions.We show it is possible to calculate RV Cn for a large candidate set of studies identified based on bibliometric information.We were able to reliably code the total number of participants and retrieve citation count data for each study in order to calculate RV Cn (step 2 in Fig. 1).However, we were only able to code uncertainty coarsely with 'number of participants in study', omitting the number of trials per participant, which also determines the standard error of the estimate (Westfall, Kenny, & Judd, 2014).
Traditional citation count metrics were highly rank-order correlated, meaning there is little difference in which source S is used in the calculation of RV Cn .Field-normalized citation counts provide a somewhat different measure of citation impact, and lead to less overlap in the final rank-order than non-normalized citation scores, especially in an interdisciplinary research topic such as social neuroscience, where publications appear across scientific fields, which leads to different articles being normalized against different citation cluster averages.Altmetric attention scores are weakly correlated with traditional citation impact, and represent a qualitatively different approach to measuring value.Whichever measure is preferred, both Altmetric scores and traditional citation counts could easily be extracted using free and open source applications (Chamberlain et al., 2020;e.g., Ram, 2017), where field-normalized citation counts or citation counts per year are not publicly available.
Finally, in-depth review of the highest ranking indicator estimates from step 2 appears to be an important method of quality control before a candidate is selected for replication.This review revealed important boundary conditions of using citation counts and the total number of participants as measures of value and uncertainty.Auxiliary hypotheses that we explored, such that past citation counts predict future citation counts, that the source of the citation counts do not substantially affect citation rank-order, and that we can control for the age of the article, were all supported.
Overall, however, we do not think our implementation of RV Cn in the social fMRI literature was successful.Modifications to either the selection procedure or scope are needed for future application in this research area.While it was feasible to reliably code sample size and citation count for over one thousand studies, several challenges hindered efficient implementation.First, the topic boundaries of a research area like "social fMRI research" are fuzzy.Social neuroscience clearly does not include volcanology studies, but it is not trivial (and perhaps not even possible) to define the borders between social neuroscience and related neuroscientific disciplines.This made it very difficult to execute step 1 of the strategy, and in spite of our best efforts to develop reliable inclusion and exclusion procedures, in every review step we discovered a substantial number of studies that should not have been included given our exclusion criteria.Second, it is difficult to say whether "number of participants" is a meaningful indicator of general uncertainty in a candidate set that contains such a wide range of study designs.While it is possible to correct for study design in theory (Isager et al., 2021), this is not possible in practice for such a large set of studies with widely varying within-subject structures.This c o r t e x 1 7 1 ( 2 0 2 4 ) 3 3 0 e3 4 6 reduces the usefulness of step 2, which is the very core of the RV Cn strategy.Third, we believe that, in step 3, more expertise with the study topics under review may be required in order to provide adequate face validation of the candidates ranked highly in step 2. In this study, we wanted a large candidate set, as a primary aim was to test the feasibility of applying step 2 to a large set of studies.However, future researchers aiming to use these or similar steps in selecting a candidate for replication may already start with a more narrow candidate set in step 1, based on their research interest and expertise.
Whether these challenges generalize to application of RV Cn in other disciplines is an open question which will need to be empirically examined.The use of RV Cn might be more straightforward in more homogenous literatures, especially if these mainly rely on between-participant designs.It may also be more feasible to adapt or modify RV Cn to account for variations in study design (Isager et al., 2021, supplementary material 2) in fields where such information is easier to curate from the articles.
The current report provides insights into how RV Cn can be applied in practice, and how its feasibility can be evaluated.However, it doesn't yet provide insight into the validity of RV Cn as a measure of replication value.Future research could attempt to provide criterion validation of RV Cn by investigating whether RV Cn is associated with other operational measures that are hypothesized to predict expected utility gain.For example, we would expect RV Cn to predict which studies are chosen for replication in practice under the assumption that both RV Cn and the selection criteria used by researchers who perform replication studies are caused by the expected utility of the replication effort (Isager et al., 2021).It might also be possible to validate RV Cn by examining the extent to which RV Cn predicts subjective estimates of the relative replication value of a set of studies.Future studies could also aim to increase the understanding of which factors researchers usually consider when selecting a study for replication.Recently, Pittelkow et al. (2023) identified a number of criteria such as interest, doubt, impact, methodology, and feasibility.How feasible it is to include such factors in a formal study selection strategy remains an open question.
We end this article with some recommendations for researchers looking to apply replication study selection strategies.For researchers specifically interested in using RV Cn to identify important-to-replicate fMRI studies in social neuroscience, our study provides some important insights.First, focusing on a relatively well-defined subject within social neuroscience literature, rather than all studies in the discipline, seems wise.For a recent successful implementation, see Zaragoza-Jimenez et al. (2023).Although this will restrict how broadly one can search for replication candidates, it will likely make it much easier to curate a candidate set of studies that includes only studies relevant to one's interests and expertise.Second, since within-subject designs are very common in fMRI studies, the RV Cn uncertainty estimate should ideally be based on the standard deviation in this field.If one elects to use sample size, it should be corrected for the design used (Isager et al., 2021, supplement 2).Be aware, however, that by using the standard deviation to estimate uncertainty one is forced to identify the effect of interest for each study in the candidate set, which will add additional work to the procedure.Taken together, while RV Cn itself can reliably and efficiently be computed for hundreds of studies, the general selection procedure (Fig. 1) seems more suited to a smaller, more homogenous set of studies than what we aimed for in this study.
It may of course also be valuable to study whether other potential selection strategies would work better than RV Cn in social fMRI research.We encourage interested researchers to conduct additional feasibility studies for other proposed strategies.
Finally, some general recommendations can be given to facilitate more efficient replication research in any discipline.First, it is important to conduct feasibility studies of a range of study selection strategies in more disciplines.As our study demonstrates, it is not enough to show that a study selection strategy works in theory or in toy examples.If we want replication study selection to be more strategic and efficient, replicating researchers will need clear guidelines for how to implement and adapt strategies in practice.Feasibility studies are needed to develop such practical guidelines.Second, this work again highlights the need to standardize the reporting of study design and statistical uncertainty as much as possible.The task of evaluating the uncertainty in scientific claims becomes easier if researchers adhered to reporting standards, and when the relationship between statistical tests and scientific claims are more clearly specified in the article (Appelbaum et al., 2018;Lakens & DeBruine, 2021).Third, in any replication study, we recommend explicitly stating why the original study was selected for replication (e.g., Pittelkow et al., 2023).By exploring and documenting the wealth of information relevant to replication study selection, we can increase the ability of researchers to make well-informed decisions about which original research would be the most important to replicate.

Fig. 1 e
Fig. 1 e General study selection procedure in which the RV Cn indicator is implemented.

Fig. 2 e
Fig. 2 e Overview of candidate selection process and data points available for each respective analysis reported below.

Fig. 3 e
Fig. 3 e VOSviewer co-occurrence map of substantive keywords retrieved from the title and abstract of articles in our dataset.Colors represent VOSviewer-defined clusters of closely related keywords.See van Eck and Waltman (2014) for further details on clustering in VOSviewer.Online interactive version of the figure: https://bit.ly/3yDPMup.

Fig. 4 e
Fig. 4 e Density distribution of citation metrics up to 200 citations.A) The distribution of raw citation counts from Web of Science (black), Crossref (red), Scopus (blue) and CWTS (orange).B) The distribution of CWTS citation impact up to a score of 10, normalized by research field/cluster.C) The distribution of Altmetric attention scores up to 100.

Fig. 6
Fig. 6 e A) Citation trajectories for all articles in the dataset.B) Log citation trajectories, with box plot summaries for each year.C) Citations obtained in 2020 predicted by the average yearly citation count from the articles publication year until 2019.

Fig. 7 e
Fig. 7 e Matrix of bi-variate correlations between age and citation indices.

Fig. 8 e
Fig. 8 e Variation in sample size between coders.Sample size is plotted on log scale.The original sample size coded is represented on the x-axis.Double-coded sample size values are represented on the y-axis.Blue circles represent values from the PhD-student coder.Brown triangles represent values from the undergraduate student coder.

Fig. 9 e
Fig.9e Distribution of sample sizes in the dataset.For visualization purposes, the x-axis limit is set to n ¼ 200.

Fig. 10 e
Fig. 10 e Scatter plot visualizing the relationship between RV WoS and RV Scopus , RV tncs , RV scite , and RV Alt .Distribution of RV WoS indicators are visualized as bars on the x-axis.Distribution of the other replication values are visualized as bars on the yaxis.Blue bars (and dots) represent the 10 highest scores on the y-axis.Red bars (and dots) represent the 10 highest RV WoS scores.Purple dots represent scores that are among the 10 highest scores on both estimators.Two of the ten studies with the highest RV WoS scores are not included in the scatter plot with the RV Alt scores, as the RV Alt could not be computed due to missing Altmetric attention scores.

Fig. 11 e
Fig. 11 e Matrix of bi-variate correlations between replication value indices computed based on different operationalizations of value through citations or Altmetrics.

Table 1 e
Journals which the articles in our initial candidate set were most frequently published in.
Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011)WTS database by author TvL), and scite™ (www.scite.ai;obtained2021-08-23by scite™ staff on request).WoS, Crossref, Scopus and scite™ citation counts are all unweighted raw counts of incoming citations of an article.CWTS citation counts consist only of incoming citations that are not self-citations.We also collected field-and age-normalized citation counts from the CWTS database.This normalization process corrects for differences between subfields in how often papers are cited on average, with the aim to treat publications from different fields equally (for details about the normalization procedure, and a discussion of the use of arithmetic averages in skewed distributions, seeWaltman, van Eck, van Leeuwen, Visser, & van Raan, 2011).The score represents how many more times the article is cited relative to the average citation count of an article in its field from the same year.Thus, our data contained three different functions w of traditional citation impact (raw count, self-citations subtracted, and field/agenormalized). Publication year data Y was collected from the WoS database.