Bibliometrically Disciplined Peer Review: on Using Indicators in Research Evaluation

Evaluation of research uses peer review and bibliometrics, and the debate about their balance in research evaluation continues. Both approaches have supporters, and both approaches are criticized. In this paper, we describe an interesting case in which the use of bibliometrics in a panel-based evaluation of a midsized university was systematically tried out. The case suggests a useful way in which bibliometric indicators can be used to inform and improve peer review and panel-based evaluation. We call this ‘disciplined peer review’, and disciplined is used here in a constructive way: Bibliometrically disciplined peer review is more likely to avoid the subjectivity that often influences the outcomes of the peer and panel reviewbased evaluation.


Introduction
With the increased use of bibliometric indicators for research evaluation, also the critique on their use has become louder. Especially after the Leiden Manifesto (Hicks et al., 2015), the publication of the Metric Tide report (Wilsdon et al., 2015), and the DORA declaration (2012), the use of bibliometrics in research evaluation has been discussed extensively. The main points of the criticism are as follows.
I. Bibliometric indicators are only reflecting a part of research output -and in research fields like the qualitative social sciences, humanities, and engineering only a small part (Mongeon & Paul-Hus 2016). II. Bibliometric indicators cover at best some dimensions of quality, but by far not all (Van den Besselaar & Sandström 2019). III. Bibliometric indicators have perverse effects as researchers and research organizations will try to play them, using strategies like salami-slicing papers and excessive self-citation, and even go so far as misconduct and manipulation (Biagioli & Lippman 2020). IV. The bibliometric digital infrastructure transforms the research system by goal displacement and task reduction (de Rijcke et al. 2016;Krüger 2020).
One conclusion from the critique could be that bibliometric indicators have changed the nature of research and that assessing the value of scholarly work is not anymore part of the academic debate, but increasingly takes place at the managerial level (Biagioli & Lippman 2020), negatively influencing the entire scientific enterprise. However, the empirical basis for such claims is, in our view, thin. Just to give a few examples, criticism on bibliometric indicators often picks out one or two measures (especially the Journal Impact Factor and the H-index) and generalizes the (possibly correct) criticism on those to indicators in general (Biagioli & Lippman 2020;Gingras 2020). Also, for goal displacement and the related perverse effects not too much evidence exists. The often-cited study on Australia (Butler 2003) suggesting that productivity indicators lead to higher productivity but at the same time to lower quality is shown to be wrong (Van den Besselaar et al. 2017). And finally, work about indicator related misbehavior also seems to rest on rather thin evidence. For example, Oravec (2019) discusses the "emerging practices (of indicator manipulation) … and their linkages with the norms and processes that support academic celebrity and stardom as well as the character of academic systems (Oravec, 2019, p859)". Oravec borrows the empirical evidence from a study by Van Bevern et al. (2016) on incorrectly merging publications in Google scholar to increase one's H-index. But this study does show the opposite: this kind of manipulation can only be done when highly similar titles are 'merged', which affects the h-index only to a rather limited extent. the former. But among the 'amateur bibliometricians', there is much disagreement too. In a recent research project 1 , some 35-panel members were interviewed, to find out which criteria were deployed. Many reviewers indicated that they did use bibliometric data 2 , but were complaining that colleague panel members were using the wrong indicators (e.g., citation counts versus the impact factor). This type of bibliometric checks can be expected to influence the assessment. Summarizing, we can characterize the situation as follows: -Peer review has big problems, as it suffers from bias, subjectivity and conservatism, and lacks predictive validity.
-Bibliometric indicators are not covering all research output and are not covering all quality dimensions.
-Bibliometric indicators are intensively used (and discussed) in the peer review practice by panel members and peer reviewers -but probably not always in a valid way. -Advanced indicators are available, but need to be brought into the process explicitly.
-For several quality dimensions, indicators are lacking and need to be (further) developed. -There is a general agreement that indicators should be included in the evaluation as an input, and that the panel members have the last word. This also reflects the position of the authors.
It seems as if the discussion on peer review and bibliometric indicators has not made much progress, probably because it has remained a debate on possible risks, without much empirical work on how the combination of the two strategies works out in practice. How 'bibliometrically informed panel/peer review' could and should look like, remains understudied. There are hardly examples of case studies to show how indicators function in the practice of valuing and evaluating, and with which consequences. Without this, it is difficult to develop models of how indicators could be used to improve evaluation. That such studies are lacking can be easily understood, as what happens inside panels is generally confidential and not accessible for research. This paper contributes to filling that gap, by investigating how the use of bibliometric indicators works in practice. So we do not compare the panel scores with the bibliometric scores -which is often done ex post (e.g., Rinia et al. 1998;Oppenheim 1996;Oppenheim 1997;Harzing 2018). Instead we describe the process of informed peer review. We analyze a case of panel review where a systematic inclusion of indicators was a core part of the process. The case is about a small to medium-sized Swedish university, where the entire research portfolio was evaluated by a committee of fourteen scholars who collectively covered all research fields present in the university. Our study is strongly helped by the fact that almost all information about the evaluation is publicly available, including a description and reflection of the panel activities and processes by the chair of the panel. The draft of that five-page report was circulated among the panel members who provided suggestions for revisions, and it can be considered as a consensus view of the panel. Finally, it helped much that the two authors of this article were involved in the evaluation process in different roles 3 , and therefore had access to the only thing that is not publicly available, but crucial for the evaluation of the effect of the bibliometric indicators: the initial scores and reports on the units. This enabled the current authors not only to analyze the panelists' view of bibliometric indicators, and how these indicators were discussed and used by the panel deliberations, but also to assess the effect of the bibliometric indicators on the final scores, without uncovering the confidential initial reports and scores.

The case
The case is a small to medium-sized Swedish university (Örebro University), with about 12.000 students and 300-400 researchers. Research is performed in many subfields of medicine, natural sciences, psychology, law, economics, 'soft' social sciences and humanities, and parts of computer science & engineering. There are three faculties: Faculty of Economic, Natural and Technical Sciences; Faculty of Humanities and Social Sciences; and Faculty of Medicine and Health. Since its foundation in the mid-1970s, Örebro University has been characterized by profession-oriented education.
Research at the university was evaluated five years earlier too (ÖRE2010). That evaluation was organized for 38 Units of Assessment (UoA). A few successful units were given strategic resources for the following five-years period. A policy was adopted to increase the scientific output of the university, as the aim was to change the regional hospital into an academic hospital with a medical school. In 2011, the university was granted the right to have a Medical faculty and exams in medicine. That was an important step in the development of the university which had started as a university college with courses mainly in social work, social science, and humanities. Over the years the college built up a capacity for research and was it granted the status of university in 1999.
The university has a relatively small research budget, which of course needs to be taken into account when evaluating research performance. The budget in the period of 2012-2014 was a bit above 1.2 billion SEK per year of which about two third is for teaching and one third for research. The university is ranked as one of the 401-500 leading universities in the Times Higher Education World Universities in 2019. The university is also ranked 74th out of the 150 best young (under 50 years) universities in the world. However, it is not ranked in the Leiden Ranking.
In 2015, the board, as well as the outgoing vice-chancellor, wanted a follow-up evaluation of all research in the university. The structure of the evaluation was as follows. A panel of 14 members was formed with some 25 % foreigners, covering all disciplines. The evaluation procedure was conceived as a meta-evaluation (ORU2015, 19 ff) based on extensive information about the research units, but without interviews with the research units and without reading publications authored by members of the research units. The evaluation had to cover the following aspects of research performance: A. The quality of research B. The research environment and infrastructure C. Scientific and social interaction D. Future potential.
The committee was asked to do the evaluation using several pieces of information: A self-evaluation report, written by the research units; a letter and a presentation of the deans of the faculties; a bibliometric report at the individual level and the unit level, as well as a summary of the bibliometric study.
Although the evaluation dimensions are similar, there are clear procedural differences with the way the research evaluation is done in other countries. Without being exhaustive, a few things should be mentioned here. First of all, in Sweden no national system for research evaluation exists, such as the Netherlands and the UK. In the latter countries, research evaluation is done at the discipline level, and not for the university as a whole. In the UK, it is done by national disciplinary committees, doing peer review of a selected set of researchers and core publications. In the Netherlands, all units in a field can be evaluated at the same moment by the same panel, but this is not necessarily the case. The Dutch evaluation is based on a self-evaluation report like in our Swedish case, which often includes bibliometric indicators of the performance of the unit(s). Although units mention five core publications, reading and evaluating the publications not part of the research assessment -similar to the Örebro approach. In contrast to the Örebro approach, the Dutch research assessment protocol includes a site visit where the committee talks with a variety of representatives of the unit under evaluation: PhD students, junior staff, senior staff, and the department and faculty management.
In our case the following information was provided: (i) The self-evaluation was the main piece of information, and it included a description of the research program and projects, grants and staff and sometimes the results, as well as organizational embedding of the research. These self-evaluations were of different quality and content, as the units could decide themselves about the content and the format. The quality and level of detail of the self-reporting seemed to reflect the quality of the research. The self-evaluation consisted of the following topics, albeit in different details: 1) a self-assessment on the four performance dimensions mentioned above; 2) an overview of the research projects and/or teams within the unit; 3) an overview of the staff, budgets and grants. (ii) Apart from the self-evaluation reports, the university board had asked the second author of this paper to produce at the unit level and at the individual level a set of (Web of Science-based) indicators, including the numbers of (full and fractional) publications, the field normalized citation scores (with and without a time window), the share of top 10% (PP10%) most cited papers (and top 1%, 5%, 25% and 50%), as well as indicators for the average number of co-authors and the average number of international co-authors. In order to avoid as much as possible discussions about whether the WoS based indicators can be used at all in research evaluation, the bibliometric report also used publication counts in terms of the 'Norwegian model' (Sivertsen 2018). This means that publications were not restricted to WoS indexed journals, but included a much larger output based on the universities' publications repository DIVA. A similar valuing system was used as in Norway, where publications points are based on the quality class they belong to. The innovation for this evaluation was that (1) Sandström 2014). This resulted in scores that can be seen as an alternative for the citation-based scores. Not all research units had data available to do this, especially for the research units in the university hospital -the main reason being that the hospital was not aligned with the university library at the time of the evaluation. Table 1 shows an example of the bibliometric results. (iii) However, the bibliometric report was very thick, and therefore not too user friendly. One of the panel members, the first author of this paper, was asked by the university board to produce a summary of the bibliometric report -a summary that would consist of an explanation of the meaning of the indicators, an explanation of how to interpret the data, and a short summary per research unit. Figure 1 shows an example of the text in the bibliometric summary. Annex 1 presents the details of how the scoring was done. (iv) In the weeks before the evaluation, each of the units was studied more in-depth by two panel members, who both prepared a draft evaluation text about the unit. That pre-evaluation described the strong and weak points and suggested a score.  Summarizing, the panel received for each unit (i) a self-evaluation report; (ii) a report with extensive bibliometric data; (iii) a short summary of the bibliometric data; and (iv) some additional information like letters of the dean(s), and an overview of teaching, and (v) two pre-evaluations by panel members. In some cases, the information pointed in the same direction, but often the scores were rather different. In the latter cases, the panel went in-depth into the arguments provided and discussed the differences extensively in order to get to a consolidated score. Overall, this resulted in consensus, as panelists were able and willing to change their views under the influence of the debate. In this consensus formation, the summary of the bibliometric report was heavily used, but it also had another advantage that we will discuss below.

Can WoS data be used in all fields?
The role and relevance of bibliometric information were discussed within the evaluation panel, using the common arguments (Brändström 2015). The main issue was whether bibliometric data cover research output in a reasonable way. Also here, this issue came up for the social sciences and humanities and for computer science. Incomplete coverage, however, is only a problem if the WoS coverage is not representative for the total output of the units. From this perspective, it was a real advantage that apart from the WoS based indicators, the bibliometric report also included publication data from the national repository, including a much wider set of publications in two ' quality classes' (based on an enlarged version of the Norwegian system). The WoS indicators were calculated against the world average and the DIVA indicators against the national field averages using a number of researchers (50-100 per field) at Swedish universities. Table 2 summarizes the differences and commonalities between the two bibliometric indicator sets.
What do we observe? First of all, the medical research units in the university hospital only use Web of Science (units 1-6) and have no DIVA score. The next three units are university-based, but have very low DIVA scores, suggesting that they hardly registered their publications in the DIVA database. The same holds for the third group of seven units (10-16) which are also in fields that are highly dominated by international journal publications. The next 13 units in three groups have (almost) similar scores in WoS and DIVA. So for these 29 units, WoS seems to provide the relevant indicators for assessing the contribution to international science.
Only the last two groups of nine units show a somewhat or much higher score in DIVA than in WoS, and we will inspect those more carefully.
- Media & communication (30), Sport science (31), and Informatics (32) show a slightly higher score in DIVA than in WoS. -Criminology (33) and culinary arts and meal science (34) score moderate in WoS and very good in DIVA -also in the highly ranked media (level 2). This suggests that the WoS score may underestimate the performance of these two units. -In the Gender studies (36) unit, two-third of all publications were authored by one researcher only, a fixed-term visiting professor. All others score rather low in terms of productivity. The publications have a low impact in WoS. The high score in DIVA is based on many papers in the higher classified media. -The Law (37) unit is large with its 22 members, and they have no international publications in WoS, which was unexpected as the theme of the unit is international law. The DIVA score (1.2) is good, but 85% of the papers are in the lower classified media. -Cultural diversity (35) and Rhetoric (38) studies score very good and good in DIVA, but most papers (85% and 95% respectively) are in the lower ranked media. Rhetoric studies had no international WoS publications. But most importantly, both groups were too small to be meaningfully evaluated (two and three researchers respectively).
Summarizing, only two out of the 38 units would probably receive a too low score if one would apply only WoS scores (nr 33 and nr 34). One may add three other units (30, 31, 32) with similar scores in WoS and DIVA, but somewhat higher in DIVA. This does not mean that we consider the local publications or the non-journal publications as meaningless. A more detailed look for those groups where DIVA and WoS strongly differ may show that some international publications are neglected in WoS (e.g., international books). For the rest, the local (language) publications may function as knowledge dissemination to stakeholders. 4 But for the moment, the comparison suggests that the WoS-based output counts are 4 However, this often does not go through publications but via other channels (De Jong et al. 2014). representative for the larger DIVA set if one measures contributions to international science. This was also the shared view of the panel (Brändström 2015).

The role of bibliometrics in the evaluation process
The use of the bibliometric data in the assessment basically boiled down to the reality check of the pre-evaluations prepared by the panel members. In cases where the scores proposed by the panelists were substantially higher or lower than the bibliometric data suggested, a discussion emerged. Those that had proposed the scores were asked to explain the differences. Often this was easy, as panelists indicated to have been somewhat too harsh, or (mostly) too friendly, and they also could easily explain why. For example, units may have been very good in the past and may have built up a strong reputation in the past that still influenced the evaluation. But as the evaluation covered recent (5 years) research only, the scoring based on reputation was far too high for the more recent research and was then lowered. A second example is that panel members had been doing some bibliometrics, in order to inform themselves about the work of the unit. This is, by the way, a very general phenomenon that we also have seen in other studies. Using bibliometrics is not something that is ' alien' to science, but many researchers do accept that bibliometric indicators say something about performance. However, this is also risky, as we observed panel members bringing in the wrong data, forgetting that field normalization should be done (which they generally cannot do themselves), and also forgetting to take the correct time frame for the publication counts. By having the correct bibliometric indicators at hand, the discussions could generally be easily closed in a consensual way.
And in again other cases, the WoS scores were regarded as too low and the good DIVA scores were included to reach the final assessment score. In the previous section, we mentioned the cases in which this happened.

Convergence of scores?
The deliberations were used to underline the scores units had received in the initial evaluation, or to change these scores. In which directions did the scores change after the deliberation? Did the panel score converge to the WoS-based score? The question of convergence is relevant for 30 units. 5 As Table 3 shows, in 27 of the units (90 %), the final panel Equal 3 Lower 23 * The inspection of the bibliometric report suggested already that WoS may have been underestimating quality (see the previous section). ** No grade as unit was too small. score was closer to the bibliometric score than the initial score given by the reviewers, whereas in the remaining three (10 %) cases the final score was more different from the WoS score than the initial score was. Overall, there were good reasons for the divergence too. As emphasized in the previous section, in some cases the WoS indicators were felt to underestimate the quality of the units, which did have many papers in (the higher quality category of) the DIVA system. This shows that the panel assessment remains important. Table 3 shows also that in two-third of the cases, the final score was lower than the initial score and in a quarter of the cases higher than the initial score, showing that the panel deliberations and the bibliometric information did have an impact more often in lowering the scores.
Despite the convergence, the panel scores remained on average somewhat higher than the bibliometric scores, as Table 4 indicates. The panel hardly gave the score 1 (weak): only five times, whereas the bibliometric score 1 was given 16 times. And the DIVA score 1 even occurred 21 times -so the average level according to the DIVA data is lower than according to the WoS data. Despite the fact that overall the panel adapted its initial score to the bibliometric score, it remained more moderate in its assessment.

Conclusions
Overall, panel members considered the inclusion 6 of the bibliometric data and the summary of the report as useful. 7 The summary was intensively used to reach the panel final consensus scores. As showed above, in many cases the WoS scores were sufficient bibliometric input to reach a final judgement -also in quite some fields within the social sciences and humanities. However, in several cases, the DIVA information functioned as a useful supplement. The bibliometric data helped to correct subjective views during the panel discussions, where all available information was used -which is exactly what should take place in a panel. As a consequence, the group dynamics played a less important role. Despite the common criticism on bibliometrics, which did also exist in the panel, the overall appreciation of the bibliometric indicators was rather positive (ORU2015, pp. 19ff.).
Some may see the influence of the indicators on the outcome as a reduction of the freedom of peers to give their assessment of the different research units, and as an example of how bibliometrics is "exercising power", and "forcing science" in specific directions, for example towards topics that are preferred by international journals. From another perspective, this "reduced freedom" could be seen as a positive instance in the evaluation system. Available information should limit the range of appreciation of the performance. That is why the title bibliometrically disciplined peer review was chosen for the current paper. In order to bring peer review to a level of disinterestedness and fairness (Merton 1973), and to avoid many of the problems of subjectivity and bias that research on peer review has reported, it would be a challenge for the bibliometric community to produce a larger set of valid indicators covering the more quality dimensions that are important when evaluating research, including quality indicators for applied research and societal impact. The current dominance of impact and productivity indicators is too narrow.
Some limitations need to be mentioned here. The participatory approach has resulted in detailed insights into an assessment process, but at the same time, in light of its confidential nature, it affects and limits to some extent what can be communicated in the current paper. For example, panel deliberations remain a social process in which opinions, interests and practical issues like time pressure may play some role (Van Arensbergen et al. 2014). This remains confidential, as are some data like the pre-evaluations. But the bibliometric study, the final assessment reports of all the units, and the full evaluation report are openly available. Together with our participation and observation of the process, this provides a rich and reliable picture. Furthermore, this paper comes with the limitations of a single case study. It would therefore be useful to have some more of these experiments to get a broader insight in how bibliometric data can be used to improve research assessment. Finally, the case presented in this paper is about evaluating universities 6 The DIVA scores were on a 4-point scale, with an average of 1.78. After rescaling to a 5-point scale, the average becomes 2.01. 7 Expressed verbally and in emails to the panel members. This is also reflected in the panel chair's report (Brändström 2015, pp. 19ff.).