A retrospective analysis of the peer review of more than 75,000 Marie Curie proposals between 2007 and 2018

Most funding agencies rely on peer review to evaluate grant applications and proposals, but research into the use of this process by funding agencies has been limited. Here we explore if two changes to the organization of peer review for proposals submitted to various funding actions by the European Union has an influence on the outcome of the peer review process. Based on an analysis of more than 75,000 applications to three actions of the Marie Curie programme over a period of 12 years, we find that the changes – a reduction in the number of evaluation criteria used by reviewers and a move from in-person to virtual meetings – had little impact on the outcome of the peer review process. Our results indicate that other factors, such as the type of grant or area of research, have a larger impact on the outcome.


Introduction
Peer review is widely used by journals to evaluate research papers (Bornmann, 2011;Bornmann et al., 2010;Jackson et al., 2011;Baethge et al., 2013), and by funding agencies to evaluate grant applications (Cicchetti, 1991;Wessely, 1998;Reinhart, 2009;Guthrie et al., 2018). However, research into the use of peer review to assess grant applications has been hampered by the unavailability of data and the range of different approaches to peer review adopted by funding agencies. As such, the majority of studies have relied on relatively small samples (Cole et al., 1981;Fogelholm et al., 2012;Hodgson, 1997;Mayo et al., 2006;Pier et al., 2018), although some studies have been performed on larger samples (see, for example, Mutz et al., 2012). To date most studies acknowledge the need for improvements (Demicheli et al., 2007;Gallo et al., 2014;Graves et al., 2011;Jirschitzka et al., 2017;Marsh et al., 2008;Sattler et al., 2015;Shepherd et al., 2018;Bendiscioli, 2019): in particular, it has been shown that the peer review of grant applications is subject to various forms of bias (Lee et al., 2013;Witteman et al., 2019).
The peer review process at many funding agencies concludes with an in-person meeting at which the reviewers discuss and compare the applications they have reviewed. However, some funding agencies are replacing these meetings with virtual ones to reduce both their cost and carbon footprint, and to ease the burdens placed on reviewers. A number of small-scale studies have shown that moving from in-person to virtual meetings had little impact on the outcome of the peer-review process (Gallo et al., 2013;Carpenter et al., 2015;Obrecht et al., 2007), and a large-scale study by some of the present authors involving almost 25,000 grant applications to the European Union's Seventh Framework Programme (FP7) reported (along with other findings) that evaluating research proposals remotely would be, to a certain extent, feasible and reliable (Pina et al., 2015).
Here we explore if two changes in the way peer review was used to evaluate proposals to a number of European Union (EU) funding programmes had any impact on the outcome of the peer review process. The first change came in 2014, when FP7 gave way to the Horizon 2020 (H2020) programme: one consequence of this was that the number of evaluation criteria applied to assess applications was reduced from four or more to three: excellence, impact, and implementation. The second change was the replacement of in-person meetings by virtual meetings for a number of funding actions.
Ensuring that the evaluation process remained stable and reliable during these changes was a priority for the EU. To assess the impact of these two changes we analyzed almost 25,000 proposals to FP7 and more than 50,000 proposals to H2020 over a period of 12 years.

Results
The European Union has been funding researchers and projects under actions named after Marie Curie since 1996. Marie Curie Actions (MCA) were part of FP7, which ran from 2007 to 2013, and were renamed Marie Skłodowska-Curie Actions (MSCA) when H2020 started in 2014. MCA had a budget of e4.7 billion, which increased to e6. The MCA/MSCA evaluation process has been explained elsewhere (Pina et al., 2015), and consists of two steps. The first step, the individual evaluation, is done entirely remotely: each proposal is assessed by (typically) three reviewers, with each reviewer producing an Individual Evaluation Report (IER), and scoring each criterion on a scale from 0 (fail) to 5 (excellent), with a single decimal resolution. During this step, the three reviewers are unaware of each other's identity. Once the IER are completed, a consensus meeting is organized for each proposal, with the reviewers agreeing on a consolidated set of comments and scores that are summarized in a Consensus Report (CR). Although based on the initial IER scores, the final CR score is usually not an average of these scores. The CR is corrected for typos and other clerical errors to produce an Evaluation Summary Report (ESR): however, in practice, this has the same content and score as the CR, so we will refer to the CR score throughout this article. Ranked lists of proposals are established based on their CR scores, determining the priority order for funding, and the top-scored proposals get funded up to the available call budget.
Under H2020, all MSCA proposals are scored (for both IER and CR) on three evaluation criteria: excellence (which accounts for 50% of the score), impact (30%), and implementation (20%). Under FP7, the number of evaluation criteria varied with the type of action, as did the weighting attached to each: there were five criteria for Intra-European Fellowships (IEF) and four for Initial Training Networks (ITN) and Industry-Academia Pathways and Partnerships (IAPP). Under both FP7 and H2020 the IER and CR scores are a number between 0 and 100.
In our analysis, for each proposal we used the average deviation (AD) index as a measure of the (dis)agreement between the reviewers (Burke et al., 1999;Burke and Dunlap, 2002). This index is the sum of the absolute differences between each individual reviewer's score (IER) and the average (mean) score for a given proposal (AVIER), divided by the number of reviewers. For a proposal evaluated by three reviewers with scores IER1, IER2 and IER3, then the AD index is (|IER1 À AVIER| + |IER2 À AVIER| + |IER3 À AVIER|)/3. The AD index does not require the specification of null distribution and returns value in the units of the original scale (0-100 in our case), making its interpretation easier and more pragmatic (Smith-Crowe et al., 2013): the closer the AD index is to zero, the greater the agreement between reviewers. We also calculated the difference between the CR score and the AVIER (CR-AVIER) for each proposal.
Categorical data are presented as aggregated sums, frequencies and percentages, and continuous data as means and standard deviations. The differences between agreement groups were tested with one-way ANOVA. We assessed the associations between the CR scores and the average of IER scores, and between the to investigate if two changes to the way peer review is organized -a reduction in the number of evaluation criteria in 2014, and a move from in-person to remote consensus meetings in 2016 -influenced the outcome of the peer review process. In the white region (which corresponds to FP7) four or more criteria were used to evaluate proposals and consensus meetings were in-person. In the coloured region (which correspond to H2020) three criteria were used to evaluate proposals: consensus meetings remained in-person in the blue region, but became remote/virtual in the yellow region. The Results section describes how the three Marie Curie actions (IEF/IF, ITN and IAPP/RISE) changed over this period. Data for certain calls were not considered for the following reasons. CR scores and AD indices using Pearson's correlation coefficient. We used the interrupted time series analysis to assess how changes in the organization of peer review influenced CR scores and AD indices over the years (Cochrane Effective Practice and Organisation of Care (EPOC), 2017). Table 1 shows the number of evaluated proposals to each of the three actions (IEF/IF, ITN, IAPP/RISE) between 2007 and 2018, along with the number of evaluation criteria and the format of the consensus meeting (ie, on-site or remote). For each of the three actions Figure 1 plots the number of proposals, the mean CR scores and the mean AD indices over the same period. The two changes to the organization of peer review made during this period appear to have had very little impact on the mean CR scores or the mean AD indices, with the changes due to the reduction in the number of evaluation criteria being more pronounced than those due to the move from in-person to remote or virtual meetings ( Table 2). The changes observed were very small and non-systematic, implying that it may probably be attributable to the large number of analyzed proposals and relatively few observation points, rather than some meaningful trend. Table 3 shows the number of evaluated proposals, the mean CR scores and the mean AD indices for each of the three actions over three time periods (2007-2013; 2014-2015; 2016-2018), broken down by scientific panels (see Methods). The two panels with the most proposals for Individual Fellowships during the whole 2007-2018 period were life sciences and economics and social sciences and humanities; whereas for ITN the panels with the most applications were life sciences and engineering, and for IAPP/RISE there was a predominance of engineering proposals. The mean CR scores and AD indices remained stable over the period studied.
We also studied the difference between the CR scores and the average of the IER scores, and the distribution of this difference is plotted in Figure 2 (along with the distributions for the AD indices and CR scores) for the three actions over different time periods. The distribution for the difference in scores is bell-shaped, with a maximum at zero difference; moreover, we found that the absolute value of this difference was two points or less (on a scale of 0-100) for 37,900 proposals (50.1% of the total), and 10 points or less for 72,527 proposals (95.9%). We also found (using Pearson's correlation coefficients) that the CR scores and the average of the IER scores were highly correlated for all three types of grants (Table A1 in Supplementary file 2). Higher CR scores also tended to have lower AD indices for all three actions ( some panels (economics and social sciences and humanities) and some actions (IAPP/RISE) having higher mean AD indices than other panels and actions (Table 3). Overall, for all proposals included in the analysis, the mean value of the AD indices was 7.02 (SD = 4.56), with 59,500 proposals (78.7% of the total) having an AD index of 10 or less (on a scale of 0-100). This suggests a high level of agreement between the reviewers. To explore if there was a relationship between the level of agreement (or disagreement) among the reviewers and the CR scores, we divided the H2020 proposals into three groups and calculated the mean CR scores for each group. In the 'full agreement' group all the three absolute differences between IER scores of each pair of reviewers were 10 points or less. In the 'no agreement' group all the absolute differences were above 10 points. In the 'other' group at least one absolute difference was 10 points or less, and at least one was more than 10 points. Of the 50,727 proposals we studied, most (31,803; 62.7% of the total) were in the 'other' group, followed by the 'full agreement' group (12,840; 25.3%), and the 'no agreement' Table 3. Number of proposals, CR scores (mean and SD) and AD indices (mean and SD), broken down by scientific panel, for the three different Marie Curie actions for three time periods between 2007 and 2018.

No. proposals (% total)
Mean CR score (SD) Mean AD index (SD) group (6,084; 12.0%). In all cases, the 'full agreement' group had the highest mean CR scores, followed by the 'other' group and the 'no agreement' group (Table 4). For the IF and ITN actions the 'full agreement' group was generally bigger than the 'no agreement' group by a factor of about two; for RISE the two groups tended to be comparable in size (with the exception of 2014, when the 'no agreement' group was much larger). We also looked at these three groups by scientific panel and found no deviations from the general trends observed at the action level (Table 5). Across all H2020 proposals, those in the 'full agreement' group had an average CR score of 85.1 (SD = 10.8), whereas those in the 'no agreement' group had a CR score of 70.3 (SD = 13.3). We also identified 3,097 H2020 proposals for which the difference between the CR scores and the average of the IER scores was greater than 10 points (Table A2 in Supplementary file 2): in 38.9% of cases the difference was positive (meaning that the CR score was higher than the average of the IER scores), and in 61.1% of cases was negative (meaning that the CR score was lower than the average of the IER scores). The mean CR score for this subsample (67.8, SD = 18.37) was lower than that for all (FP7 and H2020) proposals (79.5, SD = 12.4), and the mean AD index (12.86, SD = 6.33) was higher than that for H2020 proposals (7.38, SD = 4.74).
That result indicates that proposals having a greater discrepancy between CR scores and the average of the IER scores show higher AD indices, and end up being more difficult to reach consensus. Another clear finding of our study is that the more reviewers disagree about a proposal, the lower the proposal's final score. This trend was observed consistently over the years, for all type of actions, and in all scientific fields, confirming the observations from the FP7 dataset (Pina et al., 2015).

Discussion
Our analysis of over 75,000 thousand proposals from both FP7 and H2020, covering the period from 2007 to 2018, suggests that the peer review process used to evaluate these proposals is resistant to organizational changes, such as the reduction of the number of evaluation criteria and the format of the consensus meeting. In particular, our results suggest that face-to-face consensus meetings do not guarantee a better consensus, at least if one considers an outcome where the opinions of all reviewers involved would weigh equally in the final score (Gallo et al., 2013). Our results also suggest that the level of (dis)agreement among reviewers is more dependent on the type of action or the scientific panel, rather than the way peer review is organized. Our study has some limitations. As a retrospective analysis focusing only on reviewer scores, it cannot provide an insight into the reasons why reviewers agree or disagree on a particular proposal. Reviewers may have diverse perceptions of their role during the evaluation process, and/or interpret the evaluation criteria differently (Abdoul et al., 2012). Disagreement could also arise from inherent characteristics of the proposals, with a recent study showing that interdisciplinary proposals tend to score lower (Bromham et al., 2016), or from reviewers taking a conservative approach to controversial proposals (Luukkonen, 2012). Also, our analysis does not explore if proposals with higher scores are more likely to be successful (in terms of future outputs) than proposals with lower scores. Indeed, the ability of peer review to predict future success, as measured by scientific productivity and impact, has been subject to contradictory findings (Bornmann and Daniel, 2005;Bornmann et al., 2008;Li and Agha, 2015;Fang et al., 2016;Lindner and Nakamura, 2015;van den Besselaar and Sandströ m, 2015). Finally, projects funded by the various Marie Curie actions require researchers to be mobile, and this might limit the relevance of our findings to other grant evaluation systems, though we have tried to guard against this by analysing three different types of actions, each with different levels of complexity and different success rates.
The MSCA evaluation process is currently evolving towards a system in which reviewers write Individual Evaluation Reports that do not contain numerical scores. The IF action started operating this way in 2019, and ITN and RISE followed in 2020. Although it will no longer be possible to undertake the sort of retrospective analysis of IER and CR scores we have performed here, as IER will not have scores anymore, it will still be possible to observe the overall distribution of CR scores over time and thus monitor the consistency of the process comparing current/future evaluation exercises with previous ones. We also suggest performing such analyses for other EU funding research programmes, as happens at other major funding agencies (Pier et al., 2018;Kaplan et al., 2008;Fang et al., 2016;Lindner and Nakamura, 2015;Martin et al., 2010). This would improve our understanding of the use of peer review to evaluate grant applications and proposals (Azoulay, 2012). The COVID-19 pandemic means that the use of remote consensus meetings is likely to increase under Horizon Europe, the successor to H2020 (European Commission, 2020). As such, this study gives us confidence that the outcomes of the grant peer review process will not be impacted by this change. Table 4. CR scores (mean and SD), broken down by level of agreement between reviewers, for the three different H2020 Marie Skłodowska-Curie actions between 2014 and 2018.

Methods
The EU's Marie Curie funding programme  Table 1) as part of Horizon 2020 (H2020; 2014-2020). The Intra-European Fellowships (IEF) action and the Individual Fellowships (IF) action funded individual postdoctoral fellowships for mobile researchers. The Initial Training Networks (ITN) action and the Innovative Training Networks (also ITN) action funded projects that trained mobile doctoral candidates. The Industry-Academia Pathways and Partnerships (IAPP) action and the Research and Innovation Staff Exchange (RISE) action funded projects that promoted the mobility of staff between organizations from both public and private sectors.
The scoring scale used in FP7 and H2020 was based on five ordinal qualitative descriptors (0=fail, 1=poor, 2=fair, 3=good, 4=very good, and 5=excellent), with reviewers scoring MCA/ MSCA proposals with one-digit decimal. In that context, a difference of 0.5 points or less (ie, 10 points or less when converted to a 0-100 scale for the final IER and CR scores) can be considered as a reasonably good agreement. The evaluation criteria used under FP7 were: (i) scientific and technological quality; (ii) training (ITN and IEF) or transfer of knowledge (IAPP); (iii) implementation; (iv) impact; (v) fellow candidate's CV (IEF only). The evaluation criteria used under H2020: (i) excellence; (ii) impact; (iii) implementation. MCA/MSCA proposals were evaluated within one of the following panels: chemistry (CHE), economic sciences (ECO), information science and engineering (ENG), environment and geosciences (ENV), life sciences (LIF), mathematics (MAT), physics (PHY), and social sciences and humanities (SOC). For most of the period analysed, proposals in economic sciences and in social sciences and humanities were evaluated by the same pool of reviewers, so we have treated ECO and SOC as a single panel for the purposes of this study.

Data and analyses
The dataset in Supplementary file 1 includes data on all the proposals (n = 75,624) analysed in this study, sorted by type of action, call year and scientific panel. For each proposal, scores for the Consensus Report (CR) and the respective scores given by reviewers in their Individual Evaluation Report (IER) are reported. All analyses were performed with JASP statistical software v. 0.11.1.0. (JASP team, 2020), R v.3.6.3. (R Development Core Team, 2020), and SPSS Statistics for Windows v.19.0 (Corp, 2010).

Disclaimer
All views expressed in this article are strictly those of the authors and may in no circumstances be regarded as an official position of the Research Executive Agency or the European Commission.