Are numerical scores important for grant assessment? A cross-sectional study

Background: In the evaluation of research proposals, reviewers are often required to provide their opinions using various forms of quantitative and qualitative criteria. In 2020, the European Commission removed, for the Marie Skłodowska-Curie Actions (MSCA) Innovative Training Networks (ITN) funding scheme, the numerical scores from the individual evaluations but retained them in the consensus report. This study aimed to assess whether there were any differences in reviewer comments’ linguistic characteristics after the numerical scoring was removed, compared to comments from 2019 when numerical scoring was still present. Methods: This was an observational study and the data were collected for the Marie Skłodowska-Curie Actions (MSCA) Innovative Training Networks (ITN) evaluation reports from the calls of 2019 and 2020, for both individual and consensus comments and numerical scores about the quality of the research proposal on three evaluation criteria: Excellence, Impact and Implementation. All comments were analyzed using the Linguistic Inquiry and Word Count (LIWC) program. Results: For both years, the comments for proposal’s strengths were written in a style that reflects objectivity, clout, and positive affect, while in weaknesses cold and objective style dominated, and that pattern remained stable across proposal status and research domains. Linguistic variables explained a very small proportion of the variance of the differences between 2019 and 2020 (McFadden R 2=0.03). Conclusions: Removing the numerical scores was not associated with the differences in linguistic characteristics of the reviewer comments. Future studies should adopt a qualitative approach to assess whether there are conceptual changes in the content of the comments.


Introduction
The process of evaluating research grant proposals has attracted considerable attention in the past decade.With the increasing amount of funding for research, there is a constant need for improvements in evaluation procedures for providing funding to the most promising project proposals.Recent scoping reviews on peer review for research funding recommends, among other propositions, that there is a need for the identification of interventions that are consistent in resolving peer review issues in proposal evaluation (Recio-Saucedo et al., 2022;Shepherd et al., 2018).Studies on grant peer-review evaluation have mostly focused on the analysis of criteria used by expert reviewers when assessing proposals (Abdoul et al., 2012;van Arensbergen and van den Besselaar, 2012;Hug and Aeschbach, 2020).Other studies have investigated the linguistic content of review reports (van den Besselaar et al., 2018;Hren et al., 2022).However, to the best of our knowledge, evidence is missing on how the requirement for numerically scoring a grant proposal, i.e. attributing a numerical/quantitative score to a proposal by a given reviewer, affects the way a reviewer comments or expresses opinions related to the proposal.
The evaluation process of grants submitted to EU research programs, the so-called Framework Programmes for research and innovation, consists usually of two consecutive steps, with each proposal going through (1) an individual evaluation, made by (typically three) different expert reviewers and (2) a consensus phase, where those reviewers agree on the final evaluation of the proposal.In both parts of the evaluation, the evaluation is normally focused on three criteria: a) research Excellence, b) Impact, and c) Implementation, for which comments must be given separately.Each criterion is attributed a score that determines the total score of the proposal.The result of the consensus stage is an evaluation summary report (ESR), consisting of the consolidated concerted opinions of the group of expert reviewers.Previous studies have established this approach as a stable procedure in the evaluation of research grant proposals (Pina et al., 2021;Buljan et al., 2021).
In the previous Framework Programme, Horizon 2020 (H2020), some of the grant schemes faced changes in their scoring process.This was the case of the Marie Skłodowska-Curie Actions (MSCA), the flagship funding program dedicated to the promotion of researchers' mobility and career development at all stages of their careers.In the past, expert reviewers were asked to provide comments and numerical scores for each of the three evaluation criteria in both their individual evaluations (the so-called Individual Evaluation Report -IER) and then at the level of the consensus, resulting in the final score of the evaluation summary report (ESR).For some of the funding schemes of MSCA, this approach was REVISED Amendments from Version 1   The new version contains justifications on methodological (e.g.panel grouping) and statistical decisions made.On top of that, we published all analyses and data online in order to make our study reproducible.
discontinued, and numerical scores were not attributed anymore at the level of individual evaluations (IER).Only textual comments were required at the IER stage, and numerical scores were used at the stage of the consensus for the ESR.The aim of the procedure was to simplify the process, to give a chance to reviewers to focus on the text of the evaluation feedback, which was already suggested by previous study (Herbert et al., 2015).
A recent study has indicated that proposal weaknesses have a greater effect on the ranking, compared to proposal strengths (Hren et al., 2022).Based on this finding, the ranking of proposals would greatly depend on the reviewers' ability to identify and describe the weaknesses.In cases when there is a large number of proposals, qualitative methods of analysis can be ineffective.Therefore, quantitative analyses of the text, i.e. tools that assess quantitative characteristics of the text, can be a solution, as they have been shown to be relevant for proposal evaluation (Luo et al., 2022).
The objective of this study was to compare the linguistic characteristics of the comments related to the Excellence, Impact, and Implementation criteria in the evaluation reports of MSCA Innovative Training Networks (ITN) proposals submitted in 2019 and 2020, under H2020, in order to assess whether the removal of numerical scoring affected the structure of IER textual comments and whether this change was associated with the evaluation outcome at the consensus stage, i.e.ESR.We chose the ITN granting scheme, because it is, with around 1500 annual submissions and a success rate below 10%, among the most oversubscribed and competitive of the whole framework program.

Ethics and consent
We worked on anonymized datasets, without insight about the actual content of the proposal, or the names of the applicants or experts evaluators, so that the regulations on personal data protection were not applicable.

Study design
This was an cross-sectional study conducted in 2022.

Participants/sources of data
The data analyzed consisted in the IERs and the ESRs of all ITN proposals evaluated in the calls of 2019 and 2020.Each report includes textual comments referring to the different evaluation criteria.Scores of IER were only available for the year 2019.The anonymized quantitative data, as well as analyses, used for the analysis in this article is available on the Open Science Framework: https://osf.io/6bpvu/?view_only=.

Assessment tool
Linguistic characteristics of experts' comments were assessed using the Linguistic Inquiry and Word Count software (Pennenbaker et al., 2015a(Pennenbaker et al., , 2015b)), a program that counts words related to different psychological states and phenomena and gives a score that is a proportion of the specific category in the entire text.

Variables analyzed
We collected the data on the proposal status after evaluation ("Main list", "Reserve list" or "Rejected"), call in which they were submitted (2019 or 2020), research area, total evaluation scores, as well as numerical scores for Excellence, Implementation and Impact criteria, together with corresponding comments which separately described proposal strengths and weaknesses.We separately analysed IERs and ESRs.
For the evaluation purposes, proposals are categorized in eight panels: Economics-ECO, Social sciences-SOC, Mathematics-MAT, Physics-PHY, Chemistry-CHE, Engineering-ENG, Environmental sciences-ENV, Life sciences-LIF.For this study, we clustered the MAT-PHY-CHE-ENG-ENV into one single "research domain": PHYENG.So, the three research domains in this study were PHYENG, ECOSOC and LIF.
LIWC variables were calculated separately for strengths and weaknesses for each of the criteria assessed.They included the word count and the text tone of the evaluation report (Kaatz et al., 2015;Kacewicz et al., 2014;Pennebaker et al., 2015).

Bias
To eliminate potential sampling bias, we collected data for the whole cohort of submitted MSCA ITN proposals in 2019 and 2020.

Statistical analysis
The analysis was done using the JASP statistical program (JASP Team, 2024).
The descriptive data was presented as frequencies and percentages for project status and panel.Text characteristics were presented as means and standard deviations, or as means and 95% confidence intervals in cases of figures.We first compared the differences in all variables using a t-test or Chi-squared, depending on the nature of the variables.A P value threshold of less than 0.001 was considered to be significant in the test.

Deviation from the protocol
After performing chi squared and t-test, variables that were not significant were excluded from further analysis.We used logistic regression to compare differences between the two call years, in which proposal variables (proposal status, word count for research excellence weaknesses, word count for implementation strengths, and negative affect levels for implementation strengths) were predictors and the year of the call was the criterion.The level of significance was set to 0.05.
Overall, review comments were written predominantly in an analytic and objective language, which was indicated by the high level of Analytical tone and low levels of Authenticity; this indicates that a small proportion of reviewers formulated their arguments as personal opinions, rather than objective comments (Figures 1 and 2).
There was a greater presence of clout and emotional tone in the description of the proposal's strengths (Figures 1 and 2).Also, a greater proportion of the comments describing the strengths of the proposal had more words related to the positive emotional tone (Figures 1 and 2).
The acceptance of a proposal was predicted by the linguistic characteristics of the comments related to the weaknesses in the proposal, more specifically lower analytical tone across all three criteria's weaknesses, a higher negative emotional tone for research excellence and impact weaknesses, and higher clout in research excellence (Table 2).In total, those predictors explained around 30% of the variance of the criteria (McFadden R 2 =0.30).On the other hand, the differences between 2019 and 2020 were negligible, explaining around 3% of the variance (McFadden R 2 =0.03) (Table 3).These predictors included the number of words in excellence strengths (both for individual reviewers and consensus reports),   lower analytical tone for excellence and impact in individual reports and higher emotional tone in consensus reports.
Regarding the average scores in textual characteristics between IER and ESR scores between 2019 and 2020, the observed pattern of greatest differences between IER and ESR scores in emotional tone was stable across different proposal statuses (Table 4) and panels (Table 5), although there was difference in proportion of proposals in different disciplines.The emotional tone was overall greater for ESR results.
With regard to the differences in textual characteristics between consensus report evaluation and individual evaluation scores between 2019 and 2020, the observed pattern of greatest differences between consensus score and individual scores in emotional tone was stable across different proposal status (Table 4) and research domains (Table 5).Emotional tone was overall greater for consensus score results.

Discussion
In this study, which included all ITN proposals from 2019 and 2020 calls, we aimed to assess whether the changes in the evaluation procedure were related to differences in characteristics of review reports.We found that the differences in linguistic characteristics between reports from both calls (2019 and 2020) were small and negligible from a practical point, indicating that the removal of numerical scores did not result in meaningful changes in the reports' comments, assessed by quantitative text analysis.For both calls, the comments were written objectively, with weaknesses written with less emotion and more analytically than the proposals' strengths.On the other hand, we found that the final status of the proposals (i.e.main-listed or rejected) can be predicted by the linguistic characteristics of the reviewer's comments, especially the tone related to the identified weaknesses, indicating that weaknesses may be crucial in proposal evaluation.The difference was calculated for each project as the difference between the Consensus score result and the Individual evaluation score.The difference was calculated as Consensus score result-Individual evaluation score result; for every corresponding project.
The comments' text was written mostly using formal language, indicated by high levels of analytical tone, both for strengths and weaknesses.The same feature was observed in a previous study performed on journals' reports from peer reviewers (Buljan et al., 2021).Our results also provide evidence for a general advice to the applicantsto be very focused on the objective structure of their proposal (Baumert et al., 2022).When emphasizing proposal strengths, due to the low levels of authenticity, the reviewers less frequently used personal pronouns like "I" or "we", probably to present the proposal strengths as factual information, and not as a personal opinion.This finding is contrary to the study of Thelwall et al. (2023), which pointed out that higher use of first pronouns in reviews is related with higher proposal quality.On the other hand, in the description of weaknesses, the reviewers more often presented the information as their personal opinion.This finding is further supported by high levels of clout tone in the description of strengths.The cloutthe tone which indicates writing from a position of powerwas much higher in the description of the project strengths than in the description of the weaknesses, from which we can conclude that reviewers were more certain in their evaluation.The emotional tone was more positive in the description of strengths, probably because of the use of words related to the project's probable success.There was a greater presence of clout and emotional tone in the description of the proposal's strengths, which indicates that reviewers wrote with less confidence when discussing the potential flaws of the proposals, compared to when they mentioned the proposal's strengths.In that respect, it is to be noted that the EC services instruct reviewers that evaluation reports should not express opinions, but rather evaluate factual elements of the proposals.
The principal difference between ESR and IER was the emotional tone score.Across different categories, the emotional tone of the text of consensus ratings was higher than in the texts of IER reports, indicating a more positive tone of the outcome of the ESR.This may be because only the ESR is sent out to applicants.The IER text is not externalized and sent to applicants.In a previous study, we found that the agreement between reviewers was very high (Pina et al., 2021).At the time of the individual evaluation, the reviewers do not know whether other reviewers will agree with them.It is possible that, when reviewers need to write a consensus evaluation, they are not limited by the objective language in the evaluation of the proposal, since it is established that other reviewers agree with their opinion, so the tone is more relaxed and positive.
Our previous study of the predictive value of comments on proposals' strengths and weaknesses in ITN evaluation process used both qualitative and quantitative (machine learning) approach (Hren et al., 2022).Our present results partially confirm the results of that study, which found that proposals' weaknesses are more predictive of its evaluation outcome (Hren et al., 2022).However, we found that only some elements in the weaknesses are predictive of the proposal status.Specifically, a higher analytical tone and fewer negative evaluation words in comments related to proposals' weaknesses were associated with a more favorable funding outcome .It needs to be noted that themes that served as predictors in the regression model were identified qualitatively and were better predictors (explained around 55% of the criteria) compared to our quantitative text analysis (around 30% of the criteria).However, due to the large number of proposals, linguistic characteristics of reviewers' comments may serve as an additional tool in the proposed evaluation, as advised by others (Luo et al., 2022).
The finding that we did not observe meaningful differences in tone of reviewers' comments needs to be interpreted in the light of several limitations.It should be noted that our entire quantitative text analysis process was made by dictionary-based text analysis algorithms, which may slightly deviate from manual analysis (Luo et al., 2022), being still predictive for proposal funding outcomes, and that was partially reproduced in our study.Also, we used the most comprehensive categories, which are very broad.LIWC dictionary contains many different categories, and it is possible that by using more specific categories, one may find meaningful differences.One aspect of quantitative text analysis, sentiment analysis or analysis of the text tone, could serve as a useful tool to determine whether there were any differences in the evaluation performed by reviewers after the removal of individual numerical scores, as reviewers were a common part of both procedures.We only focused on the linguistic characteristics of the comments related to the positive and negative sides of the proposals.Also, COVID-19 resultant urgency could affect the reviewer's behavior, in a way that they were less rigorous in the review process, but to explore that, we should use more qualitative approach.Qualitative analysis of the proposals would give input on the potential differences between two calls but, due to the number of the proposals, there is a question of practical value for such an approach.Furthermore, there was the big discrepancy in number of proposals across disciplines, so the result can be mostly generalized to physical and engineering sciences, and less to other disciplines.Also, we do not have information about who the reviewers were, which may present relevant information since some individual characteristics, such as experience in research or reviewing, may influence the review process (Seeber et al., 2021).Based on our evaluation, we found no evidence that the removal of numerical scoring produced any differences in the evaluation output.
We recommend follow up of the evaluation process without numerical scoring in order to assess whether this was a transitory phenomenon, or a consistent finding over the years.We were able to test only a single year before and after the change and the future studies should cover longer time spans.

Conclusions
This study assessed whether removing of numerical scores will have a significant effect on the evaluation procedure.
The findings indicate that the removal of numerical scores did not contribute to meaningful differences in the evaluation procedure of H2020 ITN proposals or its outcome.Those results support the finding that the procedure used for the evaluation of MSCA grant proposals is very robust and stable.We recommend follow up of the impact of this and future changes to the evaluation procedures in MSCA in order to further improve its robustness and fair and objective allocation of public funds to support research mobility in Europe and beyond.
This project contains the following extended data: -Extended data.docx
Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

GENERAL COMMENT
This is an interesting and well-written article.I have few specific comments and have no substantial comments regarding the conclusions.
My main concern are the deviations from the preregistration, both regarding the measured variables and the statistical analyses.These deviations seem to be significant and increase the risk of bias.I want to stress I believe the methods used are sound and that there are likely good reasons for the deviations.However, the reader should be made aware of them and the authors should comment on: 1) Which changes were made to the methods 2) Why they were made 3) If the results would have been different if methods described in the preregistration had been followed I would prefer if the Methods section had contained a subsection called "Differences between preregistration and article" and that the results section contained comments regarding what the results would have been if the preplanned analyses had been done.
To help the reader of this review to see the differences.Here are the preplanned measured variables described in the preregistration (https://osf.io/t84ba): "The outcome variable will be the difference in final scores between 2019 and 2020.Input variables will be: year of the calls; evaluation panel; country of the coordinator (high and low research performing countries); linguistic characteristics of the individual expert evaluation in each of the three evaluation criteria (Excellence, Impact, Implementation) -LIWC analysis and another measure of sentiment analysis like RSentiment." There are no mentions of calculated indices or transformations And here are the preplanned statistical analyses described in the preregistration (https://osf.io/t84ba):

"Statistical models
The categorical data will be presented as frequencies and percentages, while numerical data will be presented as means and standard deviations in the descriptive part and with 95% confidence intervals in the inferential part of the analysis.We will compare the differences in linguistic characteristics between 2019 and 2020 and simultaneously compare linguistic characteristics between different reviewer decisions using two-way ANOVA.The data in the models will be presented as group means and 95% confidence intervals while final results will be expressed with squared eta effect size.

Transformations
None planned.

Inference criteria
We will use the standard p<.05 criteria for determining if the ANOVA and the post hoc test suggest Reviewer Expertise: Judgement and decision-making, peer-review, inter-rater agreement I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 02 Sep 2024

Ivan Buljan
Dear reviewer, Thank you for your insightful suggestions to enhance our article.We have incorporated a new section titled "Deviations from the Protocol" and provided a detailed description of the revised methods in the latest version.The new version was submitted last week and will be available in the coming weeks.If you feel the changes still fall short, we are open to making further adjustments.

Kind regards, Ivan Buljan
Competing Interests: No competing interests were disclosed.The study in hand evaluates whether reviewer comments on research proposals differed after the restructuring of the Marie Sklodowska-Curie Actions (MSCA) Innovative Training Networks (ITN) scoring scheme by eliminating the possibility of scoring using numbers.The authors utilize proposal requests data from 2019 and 2020 and found that linguistic characteristics did not differ after the removal of numerical scores from the specified scoring scheme.The following comments are applicable to the current draft of the manuscript: It would be strongly advised that the authors run grammar and typo check, or to obtain It would also be useful to further elaborate on the blinding model adopted by the MSCA and how they may have influenced the results.

○
In table 3: How can the average word count be negative?Please provide an explanation.

○
It would be great to supplement the discussion section with a few recommendations for future researchers into the topic.

○
Have the authors considered adjusting for possible confounding factors, such as research domain, etc.? If not , please discuss the possible effects of such confounding factors in the Discussion section.

○
Overall, the authors follow a logical order in introducing the main aspects of the study, provide sufficient details in some areas and are lacking minute details in other areas.The findings are clearly conveyed along with a proper interpretation of the results.The value of such work would probably benefit research administrators along with policy makers.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Yes

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Research Integrity and Peer review
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
Additionally, please justify using the specified p-value 0.001 in your model, as it appears too stringent.

ANSWER:
The reason for the P<0.001 was due to the significant number of variables tested, in order to avoid alpha error.However, we now also provide the entire analysis, where the reviewer can see that in logistic regression, all significant variables were significant at P<0.001 level.
In the results section: The majority of proposals were from the physical science and engineering fields.Did this play a role in the outcome of the study, where no differences were detected before and after?The authors may want to expand on this in the discussion section.
Additionally, is it possible that COVID19 resultant urgency have affected the review of proposals in 2020?ANSWER: Thank you for your suggestions.Both of them seem plausible, and we added them to the discussion section.
It would also be useful to further elaborate on the blinding model adopted by the MSCA and how they may have influenced the results.ANSWER: Thank you for this comment.Individual assessment of grant proposal are done independently by each expert, but they are not blinded to the applicant identity.They are blinded to the indentity of other experts, but they meet at the consensus stage.As this part of the evalution did not change, we did not address this in the manuscript as it would not have had an impact on the results.
In Table 3: How can the average word count be negative?Please provide an explanation.ANSWER: We subtracted results from ESR and IER, and in the table we present only differences.The negative direction only indicates that the values in IER were higher than in ESR.
It would be great to supplement the discussion section with a few recommendations for future researchers into the topic.ANSWER: Thanks, we added some recommendations in the Discussion section.
Have the authors considered adjusting for possible confounding factors, such as research domain, etc.? If not , please discuss the possible effects of such confounding factors in the Discussion section.ANSWER: Thank you for your comment.We have not included this in the paragraph on limitation in the Discussion section in previous version but it is included now.Thank you for your comments and we are ready to do any additional comments if needed.

Kind regards, Ivan Buljan
Competing Interests: No competing interests were disclosed.
explain why the variables were regressed on proposal status (Table 1).Please state which variable was predicted in Table 2 and why it was predicted (does "ITN calls" mean the years of the call?).
(6) Tables 3 and 4 are extremely difficult to process and understand.Therefore, I cannot follow the conclusions drawn in the last paragraph of page 6. Please describe more clearly how you arrived at your conclusions and/or present the data in a way that is easier to understand.
(7) "Discussion" and "Conclusions": Please discuss the implications of your findings for peer review practice.Please suggest what future research could investigate.
(8) Please upload the R and Jamovi code to OSF so that others can reproduce your analysis.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Partly

Are sufficient details of methods and analysis provided to allow replication by others? No
If applicable, is the statistical analysis and its interpretation appropriate?Partly Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: research evaluation, bibliometrics, peer review I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Reviewer comment:
The study examines whether and how the removal of rating scores in grant peer review changes the reviewers' texts.The analysis is based on data from a funding scheme where the rating scores have been removed at the level of individual reviews (Marie Skłodowska-Curie Actions: Innovative Training Networks, ITN).Data was analyzed using the Linguistic Inquiry and Word Count software (LIWC).The results presented in Figures 1 and 2 show that the removal of rating scores had no effect on the four LIWC categories employed in the study, suggesting that written assessments are not affected by whether reviewers are mandated to assign scores to a proposal or not.
The study examines a relevant and original research question, uses a unique data set, and employs an analytical approach previously applied to peer review data (LIWC).However, the study needs to be improved in the following respects: (1) Please justify why the five panels MAT-PHY-CHE-ENG-ENV were clustered into one "research domain" and why the two panels ECO-SOC were grouped.The samples do have very different sizes.
Reviewer comment: (2) In the section "Statistical analysis", variables are mentioned that have not been described in the section "Variables analyzed" (e.g., "word count for research excellence weaknesses").I suggest that all variables used in the study be listed in a table, including the values the variables can take.If the variables are not self-explanatory, please also describe the variable.
Author response: Thank you for your comment, the table with explanations is now provided.Table 1 Description of LIWC variables used in the study Refer Table 1 For the provided example "word count for research excellence weaknesses", the variable is word count, but only for a subset of reports that discuss proposal weaknesses.

Reviewer comment:
(3) Many analytical categories are available in LIWC.The current study used four.Please explain why these four categories were chosen and why they are appropriate for addressing the research question.
Author response: We used the four most comprehensive categories, which have been applied in most studies.It is true that by using more specific categories we might find differences, but we would need to apply and compare a vast number of categories to do that.So we opted for a more pragmatic approach in our study.We now state this in the limitations: "…Also, we used the most comprehensive categories, which are very broad.LIWC dictionary contains many different categories, and it is possible that by using more specific categories, one may find meaningful differences…" Reviewer comment: (4) In the section "Statistical analysis", it is stated that the variables The manuscript represents a high-quality text with a sound justification and is highly relevant in the discipline.The research focus is clearly established, and the procedures for selecting and analyzing data are explained in detail.
The language is formal and precise, and there are no signs of ambiguity that may lead to a lack of clarity.
The methodological design is simple and well-described, with explicit references to all the tools and variables selected for data analysis.The conclusions are relevant and proportional to the scope of the findings, which are discussed in relation to the corresponding literature.
Only one specific linguistic resource is mentioned: the use of first-person pronouns.The reader may benefit from more textual data to accompany the broad categories that the article presents.
Including specific examples from the data would have been helpful and increased the article's readability.
All in all, the manuscript shows a clear command of the topic, clarity of expression, and a sound and appropriate methodological design.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Yes Are sufficient details of methods and analysis provided to allow replication by others?

Figure 1 .
Figure 1.Linguistic characteristics scores of consensus report reviewer's comments about proposal between 2019-blue and 2020-yellow.

Figure 2 .
Figure 2. Linguistic characteristics scores of individual evaluation report reviewer's comments about proposal between 2019-blue and 2020-yellow.

Reviewer Report 14
March 2024 https://doi.org/10.5256/f1000research.153043.r246528© 2024 Qussini S et al.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Seba Qussini 1 Medical Research Center, Hamad Medical Corporation, Doha, Qatar 2 Center for Biomedical Ethics and Law, KU Leuven (Ringgold ID: 26657), Leuven, Flanders, Belgium Dr. Samer Hammoudeh 1 Hamad Medical Corporation Medical Research Center, Doha, Doha, Qatar 2 Medical Research Center, Hamad Medical Corporation (Ringgold ID: 36977), Doha, Doha, Qatar

Table 1 .
Description of LIWC variables used in the study.

Table 2 .
Ordinal logistic regression model for prediction of proposal status by linguistic characteristics of reviewer's comments a .
CI -Confidence interval; IER -Individual Evaluation Report; CR -Consensus Report; WC -Word Count.aThecategories were ordered as following: rejected, reserved, main listed.Higher odds ratio indicates greater probability for acceptance.

Table 3 .
Logistic regression in predicting ITN calls with individual (IER) and consensus (CR) comment characteristics a .
CI -Confidence interval; IER -Individual Evaluation Report; CR -Consensus Report; WC -Word Count.a Criterion variable was call year: the 2019 call was labeled as 0 and the 2020 call as 1.

Table 4 .
Average scores (mean, standard deviation) for differences between consensus score and individual evaluation scores across different outcome status categories and between 2019 and 2020 a

Table 4 .
Continued a

Table 5 .
Average scores for differences between consensus score and individual evaluation scores across different research domains and between 2019 and 2020 a .
ECO/SOC -Economic and social sciences; LIF -Life sciences; PHY/ENG -Physics and engineering.a