Professional standards in bibliometric research evaluation? A meta-evaluation of European assessment practice 2005–2019

Despite growing demand for practicable methods of research evaluation, the use of bibliometric indicators remains controversial. This paper examines performance assessment practice in Europe—first, identifying the most commonly used bibliometric methods and, second, identifying the actors who have defined wide-spread practices. The framework of this investigation is Abbott’s theory of professions, and I argue that indicator-based research assessment constitutes a potential jurisdiction for both individual experts and expert organizations. This investigation was conducted using a search methodology that yielded 138 evaluation studies from 21 EU countries, covering the period 2005 to 2019. Structured content analysis revealed the following findings: (1) Bibliometric research assessment is most frequently performed in the Nordic countries, the Netherlands, Italy, and the United Kingdom. (2) The Web of Science (WoS) is the dominant database used for public research assessment in Europe. (3) Expert organizations invest in the improvement of WoS citation data, and set technical standards with regards to data quality. (4) Citation impact is most frequently assessed with reference to international scientific fields. (5) The WoS classification of science fields retained its function as a de facto reference standard for research performance assessment. A detailed comparison of assessment practices between five dedicated organizations and other individual bibliometric experts suggests that corporate ownership and limited access to the most widely used citation databases have had a restraining effect on the development and diffusion of professional bibliometric methods during this period.


Introduction
Research organizations and research funding agencies have a growing demand for practicable methods of research evaluation, including metrics based on publication and citation data. Such bibliometric indicators remain controversial among the scientific communities affected by performance assessment [1][2][3][4][5]. In recent years, several studies have reviewed scientific developments in the area of evaluative citation analyses [6][7][8][9][10][11]. However, there is little overview regarding which bibliometric methods are actually applied by practitioners in research evaluation. This gap is addressed by the present paper. In the literature, the expression 'metaevaluation' is commonly used to denote systematic reviews of evaluation studies with regards to methodological quality and results [12][13][14]. Similarly, an 'evaluation synthesis' reviews the findings of an already existing set of evaluations [15]. In the present study, I analyse the methodologies of existing evaluation studies from a meta-perspective. However, rather than evaluating published studies according to predefined methodological standards of good practice, my purpose here is to investigate two main research questions. First, what were the prevailing methodological standards, referred to as professional de facto standards, in the field of research assessment practice during a certain period? Second, if certain de facto standards of bibliometric research assessment can be identified, which social actors have defined them?
The methodological focus of this study is on the measurement of citation impact. Other topics of bibliometric assessment, such as emerging research topics, research profiles, international collaboration are excluded. Research productivity in the sense of an input-output relationship is only assessed in a few instances, since adequate data on resource inputs (scientific staff and funding streams) are often not available [16]. Moreover, this investigation only includes 'real' evaluations, i.e. assessments conducted for purposes of decision-making in research policy or research management. I applied several complementary search strategies, and identified 138 individual studies published during the period 2005-2019, which evaluated either research organizations (RO) or research funding instruments (FI) from 21 European countries plus EU framework programs. This study gives an overview on professional practices during fifteen years of expansion of bibliometric research assessment. My initial assumption was that leading organizations within the expert field would be influential in defining professional de facto standards-first, because they have a high market share of assessment services and, second, because they serve as a legitimate role model that is imitated by other bibliometric experts. The findings support this assumption but also highlight the importance of data access and data distribution for the operative establishment of de facto standards. All studies are documented in the Annex (Supplementary Tables).
This paper is part of a project conducted with the aim of understanding the development of bibliometric assessment methods from the perspective of Abbott's sociological theory of professions [17,18]. I selected this theory to investigate how particular methodological choices become socially established as professionally legitimate means of handling certain evaluation problems. More specifically, this framework is used to address the issue of professional control in bibliometric assessment. Applying Abbott's terminology, the increasing demand for practicable and efficient assessment of academic performance constitutes a problem amenable to expert service. Research assessment is potentially within the jurisdiction of professional experts who can define the nature of assessment problems, and offer solutions that effectively address clients' needs. One recent paper presents an empirical investigation of whether the academic research area of 'evaluative citation analysis' has successfully defined scientific standards for bibliometric research evaluation during the period 1972-2016 [19]. Based on organizational network analysis and the theory of intellectual fields as reputational organizations, [19] concluded that the field of evaluative citation analysis has been characterized by low levels of reputational control, evidenced by high shares of outsider contributions and new actors entering the field throughout the examined period. They argued that this lack of reputational control within the academic research area is consistent with observed difficulties in establishing scientific authority for bibliometric assessment practice.
In this paper, I first present theoretical considerations concerning the application of professional sociology to the field of bibliometric research evaluation. Then I describe data and methods of meta-evaluation, including search strategies, selection criteria, and the content analysis of evaluation studies with regards to their methodological design and metrics. Finally, I present empirical findings and discuss results in light of the theoretical framework.

Theoretical considerations
Abbott's theory of professions is a sociological framework for analysing how professional expertise is socially constructed and institutionalized in modern societies [17,18] (Fig 1). It is applicable in the setting of a societal problem amenable to expert service, and where groups of professional actors claim relevant expertise for treatment of this problem. The theory distinguishes between cognitive claims of expert knowledge versus social claims for jurisdiction with regards to the problem diagnosis and treatment that professionals must establish in various social arenas-including the legal system, the public, and the workplace. The concept of professional jurisdiction extends beyond a merely economic notion (i.e. a market for expert services) by including the potential development of expert control regarding appropriate problem definitions and treatments from a socio-historical perspective. This framework is suitable for cross-national comparisons since it makes no specific theoretical assumptions concerning the nation state's role in the eventual settlement of professional jurisdictions. I selected this conceptual framework for the present project because it is suitable for investigating emerging professions that lack recognized domains of expertise, and which may eventually be protected by state licences but are currently engaged in competition with other professional actors for the appropriation of relatively new jurisdictions or tasks.  [17,18]. Source: [19]. https://doi.org/10.1371/journal.pone.0231735.g001

PLOS ONE
In applying this theoretical framework to the realm of quantitative research assessment, I assume that the demand for professional services predominantly arises from two important groups of potential clients: research organizations and research funding agencies (Fig 2). These organizations require reliable information concerning the performance of their scientists, research groups, and funded projects for decision-making purposes [20,21], and for accountability and legitimacy [22,23]. I thus assume that demand largely stems from a mesolevel of organizations within the public research system. Although private firms also use bibliometrics, information concerning research performance assessment in the private sector is not systematically accessible and is thus not part of the present study. Additionally, several European countries are experimenting with the introduction of performance metrics to evaluate institutional funding and research on a national scale [24][25][26][27][28][29]. Italy demonstrated the most extensive use of bibliometric performance measurement during the observation period. In 2006, the Italian National Agency for the Evaluation of Universities and Research Institutes (ANVUR) was created with the mandate to evaluate all public research-an exercise called "Valutazione della Qualità della Ricerca" (VQR) [24]. The present meta-evaluation included bibliometric reports from first two rounds of the VQR, covering the 2004-2010 (VQR I) and 2011-2014 (VQR II) [30], as well as national evaluations with a disciplinary scope conducted by the Nordic Institute for Studies in Innovation, Research and Education (NIFU). Since the 7th framework programme, and continued under Horizons 2020, the European Commission implemented the "Research infrastructure for research and innovation policy studies" (RISIS) with the aim to "build a distributed infrastructure on data relevant for research and innovation dynamics and policies", but this collaborative project does not include the development of alternative literature and citation databases [31].
According to the theoretical framework, bibliometric assessment is a service provided by professionals. In 90% of studies in this meta-evaluation, bibliometric analyses were conducted by external experts, who often worked on behalf of the organization to be evaluated. Another 9% of studies were performed by bibliometricians employed by large non-university research organizations and funding agencies-namely the Spanish Consejo Superior de Investigaciones Científicas (CSIC), the German Max Planck Society (MPG), and the Swedish Research Council (VR).
I distinguish between individual bibliometric experts and organizations that are dedicated to bibliometric assessment services. Individual bibliometric experts are typically academics employed at universities or non-university research organizations, who conduct bibliometric studies as part of their individual research activities. The label "dedicated organizations" includes the Dutch Centre for Science and Technology Studies CWTS; the Nordic Institute for Studies in Innovation, Research, and Education NIFU in Oslo; the consultancy branch of the Web of Science Group owned by Clarivate Analytics, abbreviated here as TR/ Clarivate; ANVUR, the Italian state agency that implements the VQR; and the expert group working at CSIC, a large Spanish non-university research organization. CWTS, NIFU, and TR/ Clarivate are also referred to in the text as "expert organizations" in the sense of [18], while ANVUR is a state agency, and the CSIC bibliometric group conducts only studies on CSIC and its branches and do not offer professional services to other clients.
According to Abbott's theory, professionals' work can generally be described as the application of abstract knowledge to complex individual cases. Abstract knowledge lends legitimacy to claims of jurisdiction, tying professional work to the general values of logical consistency, rationality, effectiveness, and progress. Such scientific legitimacy includes definition of the nature of problems, a rational means of diagnosing these problems, and delivery of effective treatment. Moreover, abstract knowledge enables the instruction and training of students entering the profession, and facilitates the generation of new mechanisms of diagnosis, inference, and treatment. Abstract knowledge is typically accumulated by an academic sector closely related to the profession.
In a recent study, [19] investigated the academic research area of evaluative citation analysis as the academic sector that is closely aligned with bibliometric evaluation practice. Abstract knowledge can also be stored in specialized artefacts, which Abbott refers to as expert commodities. With regards to evaluative bibliometrics, the most important artefacts for professional work are citation databases, such as Web of Science (WoS) or Scopus. Another recent study [32] showed how science policy in the Netherlands stimulated the formation of quantitative research assessment as a new professional jurisdiction since the late 1960s in the form of an expert organization: CWTS. Using Abbott's framework, this study argues that the professional work of CWTS is subordinate to the older jurisdiction of peer review and may develop into an advisory jurisdiction in the future.
The present study complements the two aforementioned studies and examines actual evaluation practice, as visible in mostly publicly available evaluation studies. The included studies use publication and citation data to evaluate the performance of research organizations or funding instruments in Europe, and are published either as study reports (grey literature) or as journal articles (see methods). The authors of these studies include bibliometric experts and dedicated organizations, while the objects of evaluation are research organizations and funding instruments. The presently utilized definition of professional practice excludes ad-hoc uses of bibliometric indicators that are less explicitly codified-for example, in cases where research organizations use performance metrics to evaluate staff performance, or where funding agencies use a journal impact factor or the h-index to make unpublished selection decisions among program applicants. This study is confined to Europe, i.e. the evaluation objects must be located in a European country. Thus, my analysis of widespread assessment practices further contributes to the knowledge of commonalities in a European research area, as promoted by the European Commission [33].

Data and methods
In this section, I first describe the selection criteria for inclusion of studies, as well as the search strategies used to identify such studies, and discuss generalizability in light of this sampling strategy. Second, I present the coding scheme and procedures applied to extract methodological information from each individual study. In line with the theoretical framework, practices of bibliometric research evaluation are examined for an extended historical time period. The study focuses on work products of professional actors in the form of published evaluation reports (grey literature) or journal articles. In most cases, these evaluation studies have been designed by bibliometric experts for their individual clients, i.e. decision makers from research organizations or funding agencies, and the professional diagnostics proposed by the bibliometricians have been accepted by the respective clients in a contract relationship. Therefore, assessment reports document an important segment of professional work. Other, less codified applications of bibliometric indicators were excluded, in particular performance assessment by private firms and research laboratories, software tools such as SciVal and Incites, and unpublished or ad-hoc use for decision-making by research organizations or funding agencies. While these uncodified applications also seem relevant for the question of de-facto standards, there is currently no systematic information on their usage and diffusion. The selection criteria for professional evaluation studies were as follows: 1. Each evaluation must include a publication and citation analysis. The sample includes studies relying exclusively on bibliometric data, as well as multi-dimensional evaluations that combine bibliometric data with other information, such as peer evaluations, financial and staff data, or case studies [34,35]. In either scenario, my analysis focused only on bibliometric analyses.
2. The objects of evaluation are either research organizations or funding instruments. Research organizations are typically universities and/or their departments or faculties, or extra-university public research institutes. Funding instruments are typically active in supporting research projects or individual researchers at public research organizations, sometimes with the involvement of private firms, and sometimes supporting long-term investments, such as excellence schemes.
3. Evaluation objects (research organizations and funding instruments) must be located in Europe.
4. The evaluation had to be conducted with the stated purpose of informing decision making on behalf of the respective research organizations or funding instrument. Purely academic studies of bibliometric data were excluded from the study sample. Relevant studies from this sample were identified by analysing keywords and journal titles. The most frequent keywords related to research evaluation were "performance", "universities", "departments", and "faculty". I identified ten relevant journals within this publication sample, including "Research Evaluation", "Research Policy", "Research in Higher Education", "Evaluation Review", "Zeitschrift für Evaluation", and "Higher Education". From an initial set of 898 publications that were identified as potentially relevant (315 for 2005-2014; 583 for 2015-2019), 33 evaluation studies were retained (24% of the study sample). This search strategy proved particularly valuable in that it retrieved a total of eleven assessments published in medical journals, including for example the British Journal of Neurosurgery or the European Journal of Cancer, and four other disciplinary and foreign language journals, that would have escaped a more conventional search strategy based on a pre-defined set of core journals in scientometrics, science policy, and research evaluation. 5. I included reports from the Italian national evaluation agency ANVUR which has a legal mandate to evaluate the quality of activities performed by all research organizations receiving public money, and by funding instruments focused on research and innovation [37]. Reports are included from the first and second evaluation rounds. VQR I covers the period 2004-2010, and included nine of the 14 disciplinary areas in the Italian system, while VQR II covers 2011-2014 with 11 disciplinary areas. Since each disciplinary committee has the mandate to determine the appropriate evaluation criteria within its field(s) of research, the reports of the different sectors were treated as individual bibliometric exercises in this meta-evaluation, although I found that bibliometric methods in VQR II were more streamlined than in VQR I. Thus, the VQR assessments were treated as 20 individual studies to my study sample (14% of study sample).
6. Finally, I searched the worldwide web for evaluation reports by funding agencies. Some countries and agencies follow high standards of transparency with regards to public research evaluation, including the Swedish Council for Science (VR), the Swedish Environmental Protection Agency (SEPA), the Danish Council for Strategic Research, and the British Wellcome Trust, among others. These web searches identified a total of 23 relevant and publicly available evaluation reports (17% of study sample).
The generalizability of the descriptive findings on professional practices hinges on the quality of the sampling process. Randomization was not possible due to the exploratory nature of the study. Relative to the size of their national research systems, some countries are strongly represented while others appear underrepresented. I am confident that this variation is to a large extent due to different levels of bibliometric activity, since some countries have a tradition of quantitative research assessment while others do not [26,27,32]. Relatedly, some professional actors have published a large share of the study sample while others have authored only one or two reports. Since my purpose is to describe the professional field, I have to deal with the fact that this field is dominated by certain actors holding a large "market share". In order to deal with this unequal distribution and to compare different segments of professional practice, I decided to analyse the bibliometric methods by the five largest "dedicated" organizations (CWTS, NIFU, TR/ Clarivate, VQR, CSIC) separately from the rest of the "other" bibliometric experts.
The final study sample includes 138 distinct bibliometric studies, of which 102 (74%) evaluate research organizations and 36 (26%) evaluate funding instruments. The Italian VQR was the largest evaluation exercise within the sample in terms of number of researchers and publications under assessment. Since discretion on impact metrics was given to disciplinary committees, I treated the VQR as two rounds of parallel assessments to examine the methods used in each case (n = 20, S1 Table). The largest share of studies was produced by the CWTS, a contract research institute at Leiden University specializing in bibliometric assessment services (n = 37, S2 Table). NIFU is a non-university institute conducting research on the Norwegian and neighbouring Nordic science and innovation systems (n = 12, S3 Table). The assessment service by former Thomson Reuters Evidence Ltd, today part of the Web of Science Group owned by Clarivate Analytics, and abbreviated here as TR/ Clarivate, is represented with studies on funding instruments in the UK and EU (n = 7, S4 Table). The Spanish CSIC is an extra-university research organization that has its own internal evaluation unit (n = 5, S5 Table). Since studies from the same organization used identical citation impact metrics (CWTS, NIFU, TR/ Clarivate, CSIC), or at least shared important characteristics (VQR), I separately analysed the respective subsets with regards to some dimensions. The remaining 57 studies are labelled as studies by "other bibliometric experts" (S6 and S7 Tables for evaluations of research organizations and funding instrument). In this way, I can also observe if the methods by the "other" bibliometric experts are similar or different from those employed by prominent organizations, which is relevant for the question if the latter function as role models of methodological know-how. While the meta-evaluation was not designed as a country comparison, some national differences in assessment methodology are readily apparent as a result of this clustering.
Some uncertainty remains, since the coverage of individual nations does not only reflect the diffusion of bibliometric methods, but may also result from different national transparency policies. For example, evaluations of the German Max Planck Society are usually kept confidential. In contrast, publication is mandatory under transparency rules in Sweden, such that evaluation reports are generally available on the internet. In line with Abbott´s theory, the meta-evaluation approach should be considered in tandem with case studies of national jurisdictions, such as [32], to elucidate how science and higher education policies interact with the application and acceptance of bibliometric research assessment. There is no prima facie reason to assume that bibliometric techniques systematically differ between confidential sources and published reports, except with regards to the reported level of aggregation. For example, published reports by the Italian VQR and CWTS do not contain individual data although the reports state that the same methods were also applied at the level of individuals (VQR) or teams (CWTS). On the other hand, systematic methodological variation between countries with more transparency in contrast to countries with more confidential policies cannot be excluded. Consequently, the results of this study refer only to the available studies and cannot simply be generalized to unpublished work.
Analyses are presented cross-sectionally over the whole study period 2005-2019 in the main text. Longitudinal analyses comparing three five-year periods (2005-2009; 2010-2014; 2015-2019) are documented in supplementary materials (S8-S13 Tables). Bibliometric research assessment has expanded considerably during the period investigated from 2005-2019 and was fiercely debated among academics and evaluating agencies in several European countries. Also since the mid-2000s, academic publications on citation indicators soared, as documented in [19]. A comparison across time-periods mainly shows that bibliometric assessments spread to a larger number of countries within Europe, while an increasing influence of certain professional standards could not be asserted. In other words, while the present study covers fifteen years of expansion of bibliometric research assessment in Europe, institutional obstacles to bibliometric professionalism should be analysed in more detail as a next step [38].
Before answering the question if there are certain de-facto standards in terms of indicators used, I need to ask to what extent there is agreement in evaluation objectives. To a large extent, variation in study objectives is limited by the study selection criteria. The choice of citation indicators usually implies some sort of performance comparison between different units under study. It is important to understand that similar methods of performance assessment can be and are employed for widely different policy and management purposes [39]. For example, [21] argued that bibliometric evaluation was used by Dutch research organizations with the aim to identify promising young researchers that had not yet been fully established in terms of disciplinary reputation and networks, but also to put underperforming research areas on display in order to justify change action by university management. Quite differently, the Italian assessment exercise VQR was designed to inform central decision-making concerning the redistribution of national block funding between universities and academic departments. The important policy objectives behind the choice of metrics are not necessarily well documented in the bibliometric reports, and may even fluctuate during the policy process, since informal goals may be as influential as stated formal goals. Therefore, the meta-evaluation is confined to analysing formally stated goals, the frame of reference, and the main dimensions of comparison in each study for judging similarity in evaluation objectives.
Methodologically, this meta-evaluation is based on structured content analysis. I analysed the bibliometric design of each individual study using a scheme of 37 coding questions (S1 Appendix) in the following ten topical areas: (1) meta-information regarding the individual study, (2) the professional framework, (3) the object of evaluation, (4) the citation databases, (5) quality enhancement of the bibliometric raw data, (6) sampling strategy and data collection, (7) research fields under evaluation, (8) definition of citation data, (9) citation impact indicators, and (10) utilized statistical methods. Most items involved a nominal level of measurement, i.e. non-ordered qualitative characteristics. Five items were formulated as open questions, enabling raters to record more detailed information. This coding scheme was developed via an iterative procedure beginning with a partial sample. To test interrater reliability, two raters applied the initial coding scheme to an initial sample of 20 different studies. When coding differences became apparent, they were discussed among the two raters and the items were improved to reduce their ambiguity.
This remainder of this section comments on methodological choices involved in the design of the coding scheme (S1 Appendix). The topical area (2) "professional framework" was derived from sociological theory, distinguishing bibliometric experts working as external contractors from those who are employed as staff of the organization to be evaluated. Most other coding items were defined following the methodological literature on citation analysis, particularly [6,8,21].
Regarding (3) "the object of evaluation", I distinguish research organizations and funding instruments, which pose different challenges for evaluation design. In general, publications are linked to authors and authors are linked to research organizations via their institutional affiliations. In this way, research organizations are treated as author aggregates, with large variations in scale. It is more difficult to attribute publications to individual funding instruments because authors, and author teams in particular, typically receive funding from diverse sources and there is a variable time lag between funding input and publication output. Only during the most recent years, funding acknowledgements were in a few cases applied for funding instruments evaluations, and are therefore not included in this analysis.
(4) "Citation databases" includes major multidisciplinary databases WoS and Scopus along with Google Scholar and other specialized disciplinary databases, such as PubMed for the medical sciences or MathSCInet for mathematics. These databases are commodities storing abstract knowledge according to [18].
(5) "Quality enhancement of bibliometric raw data" refers to the fact that quality of citation data as offered in conventional licences by WoS and Scopus is insufficient for most bibliometric purposes. It is at least necessary to disambiguate author names and institutional affiliations, but also journal names. Data quality also includes controls of the accordance between the actual research fields of an evaluation object and their operationalization for assessment purposes, which can be regarded as checks for external validity.
The topical areas (6) "sampling strategy and data collection", (7) "research fields under evaluation", (8) "definition of citation data" refer to methodical details of data collection, for example time period, treatment of self-citations, and citation windows of a given analysis.
(9) "Citation impact indicators" belongs to the core of the meta-evaluation. Since a large array of citation impact metrics have been proposed in the literature ( [19] collected 169 indicator variants published 1972-2016), the question is how detailed the analysis should be in order to meaningfully describe convergence or divergence in terms of indicators. Based on widely used bibliometric reviews [6,8,[40][41][42], I distinguish among six groups of metrics: (a) journal impact metrics, (b) field-normalized arithmetic mean, (c) other field-based percentiles, (d) hindex and h-type indices, (e) source-normalized metrics, (f) indirect impact metrics, and (g) other metrics as rest category, and to record combinations of these. An open question was provided in order to include the formula and methodological details of respective indicators. These six categories describe divergent measurement concepts on an aggregated level, but it is possible to examine variation within categories through the open question. The use of normalization was encoded separately including the underlying classification of scientific fields.
(10) "Utilized statistical methods" refers to information on the significance of performance differences across study units, but also includes an open item recording special features such as innovative tools for analysis or software.

Bibliometric research assessment is most frequently used in the Nordic countries, followed by Italy, the Netherlands, and the United Kingdom
I found instances of bibliometric evaluation in many European countries, but the most regular use of bibliometric assessments during the observation period was concentrated in a few countries. Overall, my sample includes studies from 21 countries plus Framework Programs of the European Union. Approximately 26% of the studies, including four cross-country evaluations, were performed by the four Nordic countries Sweden, Norway, Finland, and Denmark, followed by Italy, the Netherlands, and the United Kingdom (Table 1). The Netherlands and the Nordic countries are medium-size research systems that show strong performance in international comparison. Among the larger public research systems in Europe, Italy is the only country that performs national-scale bibliometric assessment. The UK Research Excellence Framework only uses bibliometric data to inform peer review [9,25], but the sample includes evaluations from several important British funding agencies, including the Medical Research Council and the Wellcome Trust. Germany has no national framework for research evaluation [43], and my search strategy did not yield any bibliometric assessments from France. A longitudinal analysis reveals that bibliometric assessment only recently spread to Eastern European countries, including Romania, Lithuania, Serbia, and Slovakia.

The Web of Science (WoS) is the dominant database for public research assessment in Europe
The bibliometric evaluation of public research in Europe during the observation period was largely based on the citation indices contained in the WoS. Of the total sample set, 87% of  Table). The use of Scopus still derives in large part from the Italian VQR. Some studies employed designated databases, such as PubMed or MathSCInet, but these alternative citation databases exist for only a few disciplines. In other studies, citation data were complemented by national databases, which are more comprehensive in terms of research products but do not contain original citation data [44]. For example, the Norwegian Current Research Information System (CRISTIN) includes a larger array of document types, such as books and book chapters [27]; and the Italian VQR includes all types of research outputs, e.g. software, patents, maps, and artworks [37]. While 7% of studies relied on the search engine Google Scholar to cover a larger variety of sources, the TR/ Clarivate book citation index BKCI was not used at all within this sample [45]. One important issue in bibliometric assessment is the extent to which bibliometric databases actually cover the investigated research fields [3,21,46]. Expert organizations have applied different methods to address this question of external validity. VQR and NIFU used their respective national publication databases to determine the coverage in international citation databases (external coverage) [47]. Using another approach, CWTS analysed the database coverage of the references cited within the studied publication sample (internal coverage) [21]. Among the studies by other bibliometric experts, only 18% investigated database coverage.

Expert organizations invest in the improvement of WoS citation data and set technical standards for data quality
Raw citation data, as provided by WoS or Scopus, require considerable processing before they are adequate for the assessment of authors and research organizations [48]. The main issues are the ambiguity of author names and institutional addresses, and the unambiguous assignment of authors to research institutions ( Table 3). The correct disambiguation of institutional name variants requires detailed knowledge of national research systems. These and other technical problems can further lead to certain proportions of false citation linkages in the raw data [49]. Expert organizations-including CWTS, NIFU, the Italian Institute for System Analysis and Computer Science (IASI), the German Max Planck Society, and the German Competence Centre Bibliometrics-currently deal with this situation by buying raw data from database providers (Clarivate Analytics, formerly Thomson Reuters, sometimes complemented by Scopus Elsevier) and constructing in-house databases with improved data quality. Access to WoS citation data of improved quality was available to 48% of the total study sample, including studies directly by TR/ Clarivate, but only to 19% of studies by other bibliometric experts. Consequently, studies by the other bibliometric experts also frequently mention the effort required for disambiguation of institutional addresses (51%) and author names (47%). Not surprisingly, one of the main practical arguments for the h-index is its greater robustness with regards to incomplete publication and citation data. A related issue is the need to verify the completeness of publication records. If a completed analysis were later found to be missing individual highly cited papers, this could seriously jeopardize assessment credibility in the eyes of stakeholders. Italy and Norway impose a mandatory requirement for each scientist to register a certain number of publications (Italy) or all publications (Norway). These national publication records provide the basis for author searches of citation databases for evaluation purposes. CWTS uses a different approach, collecting internal publication records from research organizations, and sometimes sending these records to the authors for personal verification and completion. Personal verification by authors was also used by 21% of the studies by other bibliometric experts.

Citation impact is most frequently assessed with reference to international scientific fields
As stated in the method section, the formal evaluation objectives are broadly homogeneous in that they all involve a comparative assessment of research performance. This section further examines the frame of reference, i.e. how the relevant comparison for performance assessment was construed in each case. It is striking that the analysed publication samples vary in size by orders of magnitude. For research organizations, the modal size category is 1,000-10,000 publications. In contrast, for funding instruments, the modal category is 10,000-100,000 publications. There are a total of eight studies with sample sizes over 100,000 publications, five of which are in the area of medical sciences, and three investigate multiple research organizations. (Table 4).
To analyse the frame of reference in greater detail, research organizations were differentiated according to scale (number of institutes or universities) and scope (mono-disciplinary vs. multi-disciplinary), while funding instruments were distinguished according to type of unit funded (research projects, scientists, research organizations, or portfolio review) (rows in  Table 5). Concerning performance measurement, I distinguished between those with "international field comparison", "national rankings", and "other" (columns in Table 5). International field comparison refers to the assessment of observed citation rates with reference to the expected citation rates for the same research field and time period (often also for the same document type) throughout the entire database [6,50]. This type of measurement was used in 64% of the study sample, including 55 studies by CWTS, NIFU, and TR/ Clarivate (except B6). In contrast, with national rankings, the relative national position defines research performance. This type of measurement was used in 25% of studies, including all 20 studies by the VQR I-II [37], but also 14 studies by other bibliometric experts. While international field comparisons occurred across all categories of evaluation objects (rows in Table 5), national rankings were used only for the comparison of departments and institutes, often by h-index/ htype indices within single fields (rows 1.3; 1.4). The remaining category "other" contains mainly inter-group comparisons, including some quasi-experimental designs (funded vs. non- funded scientists) in studies on funding instruments. Dedicated organizations used international field comparisons more frequently (70%) than other bibliometric experts (56%). Notably, I identified the divergence of the Italian VQR from the dominant approach in professional evaluation practice. The predominance of international field comparisons is affected by three dedicated organizations: CWTS, NIFU, and TR/ Clarivate. A closer look at the subset of 32 studies by other bibliometric experts with international field comparisons reveals two important sources of influence on their choice of methods: at least ten studies explicitly use indicators by and refer to authors from CWTS in their method section (F1; F6; F7; F8; F20; F27; F29; G5; G8c; G12), documenting the status of CWTS as a leading expert organization. Second, in eight cases, expected citation rates were purchased directly from Thomson Reuters (F12), based on Essential Science Indicators (F3, G15) or Incites (F19; F20; F27; G18; G19), in addition to the seven bibliometric assessments commissioned directly from TR/ Clarivate. Fewer studies used Scopus (F16; F18; F26; F28; F30) or Google Scholar (F18; G5; G14) as alternative or additional sources. These findings show that while there remains considerable variation in the details of the calculation of international field averages during the period observed, a leading expert organization and a single database provider are important sources of conceptual and methodological convergence.

The WoS classification of science fields functions as a de facto reference standard for research performance assessment
Dedicated organizations generally used field-normalized indicators for impact assessment, most frequently either field-normalized top-percentiles, field-normalized arithmetic mean, or both (Table 6), while a more diverse picture emerges for other bibliometric experts. Some individual experts adhered to the same professional framework as defined by CWTS, NIFU, and TR/Clarivate, based on international field comparison and field-normalization (39%). Examples include bibliometricians at the Swedish Science Council, the German Max-Planck Society, or studies by the Canadian expert organization Science Metrix. But there are also experts more In principle, field normalization is applicable to different types of citation metrics [6], including arithmetic mean, highly cited percentiles, h-type indices, and indirect citation metrics, and also journal impact. To avoid issues regarding field normalization, exclusively source-normalized impact metrics have been construed as a methodological alternative [41]. However, only some of the possible combinations are actually found in my sample. Eight studies used JIF-Quartiles, a variant of field-normalized journal impact (included in row 2.2), but no study uses a field-normalized h-index. No study calculated either indirect citation (prestige) indicators or source normalized indicators for observed citations. Instead, citation databases provided indirect and source normalized journal impact metrics including Eigenfactor metrics (TR/ Clarivate) and SNIP (Scopus) in recent years. In general, the overview in Table 6 reveals that research assessment practice during the study period included few of the methodological inventions that have recently been proposed in the academic debate on impact metrics [8,19,51].
Journal impact was used quite frequently (49%), despite repeated criticism that the substitution of a journal's impact for the actual number of citations lacks validity [52,53]. While 91% of studies report observed citation data, nevertheless journal impact was often used to substitute missing data, either because publications are so recent that actual citations are not yet available (e.g. VQR), or because articles are published in journals that are not covered by the database (e.g. NIFU and VQR). Notably, journal impact metrics are easily accessible via Journal Citation Reports, which is especially relevant for bibliometricians lacking fully licensed access to citation databases. Sometimes journal impact was used very pragmatically, for example simply for distinguishing between two broad levels of journal quality ( Overall, 78% of all studies used field normalization, including the field-normalized arithmetic mean, other field-related percentiles, and field-normalized journal impact (Table 7). Among these 107 studies, 83% relied on the WoS classification of science fields (WoS subject categories SCs), and an additional 5% used the related Essential Science Indicators classification by TR/ Clarivate. The Scopus science classification was mainly used in the Italian VQR. Of all field normalizations, 17% were based on self-defined journal sets, sometimes combined with keywords. Both, the alternative journal-based classifications proposed in the academic literature, and the publication-based clusters by CWTS, have been developed to overcome methodological problems associated with the WoS classification of science fields. CWTS actually replaced the WoS classification by a taxonomy of approximately 4000 clusters since about 2016 [54]. Yet while alternative journal-based taxonomies have had little influence to date, and have not been used at all by dedicated organisations, the publication-based clusters, on the other hand, are proprietary and thus not available for use by other professionals. It can be concluded that thus far the WoS classification of science fields has retained the status of a de facto reference standard.
There has been some debate in the literature regarding the adequacy of WoS SCs as the basis for field normalization. One point of concern is the lack of transparency regarding the methodology used to construct and update the categories [56,58]. One study investigated the adequacy of the WoS classification, and empirically demonstrated that very few journals in the WoS were miscategorized in terms of their direct citation relations [58]. These authors found that some journals in the WoS do not display strong citation relations with their present category, but these journals generally appeared to be located between fields rather than to have been suboptimally assigned. Overall, WoS performs significantly better than Scopus with regards to the adequacy of journal categorization, which is related to the fact that journals belong to fewer categories in WoS than in Scopus.
Notably, [58] do not address the more fundamental question of how journal citation clusters are distributed across WoS SCs. Random walk models demonstrate that the total interjournal citation network is characterized by densely connected regions and areas of much lower citation traffic [59]. The WoS SCs substantially vary in size; thus, it seems likely that some categories contain several clusters while others may comprise only one or even no cluster at all. [60] demonstrated that the WoS category "Library and information sciences" includes two clearly distinguishable journal citation clusters, and that eight journals publishing research in "Science and technology studies" belong to eleven different WoS SCs. Concerns have been raised that the internal heterogeneity of SCs with regards to research topics and citation densities may pose serious problems for field normalization [61]. One possible way to address this problem are more fine-grained publication-based clusters [54].

Discussion
In this paper, I analysed the methods used in 138 bibliometric evaluation studies, from a metaperspective informed by Abbott's theory of professions [17,18]. In contrast to conventional meta-evaluations that assess how well a set of evaluation studies adheres to predefined methodological standards, the purpose of my meta-evaluation was to investigate whether professional de facto standards could be observed in bibliometric research assessment. More precisely, this study posed two research questions. First, what were the prevailing methods of bibliometric performance assessment in European evaluation practice during the period 2005-2019? Second, if methodological de facto standards existed, which actors were in the position to define them? A detailed review of assessment methods revealed that bibliometric assessment was more frequently performed in the Nordic countries, Italy, the Netherlands, and the United Kingdom, and that WoS was the dominant database used for public research assessment across 21 European countries. Expert organizations that invest in improving WoS citation data were able to set technical standards with regards to data quality. Citation impact was most frequently assessed with reference to the WoS classification of science fields (SCs), which thus far retained the function of a de facto reference standard.
My findings demonstrate two main choices regarding the design of bibliometric research assessments. First, there is the choice between international field comparison, national ranking, and other designs, such as inter-group comparisons, as the main standard of performance. Expert organizations such as CWTS, NIFU, and TR/ Clarivate clearly define international field comparison as the predominant professional framework across all categories of evaluation objects, while the Italian VQR deviates from this professional standard by assessing Italian universities via national rankings on composite performance indicators that are unknown elsewhere. Italy does not follow a model of bibliometric professionalism but applies bibliometric assessment as an element of central state governance of universities [38].
The second and related choice concerns field-normalized citation impact versus h-index and h-type indices. I found that all dedicated organizations use field-normalized citation impact, while h-type indices are more prevalent among those bibliometric experts that do not have access to high-quality citation data, either because the authors come from disciplines applying bibliometric assessment, often medicine, or from countries located at the periphery of the European science system.
These findings provide support to [32] that highlighted to key role of expert organizations in shaping both the academic and professional field of bibliometric evaluation. The analysed study set clearly documented the prominent position of expert organizations in the field. The two most prominent organizations were CWTS and NIFU, both of which have regularly conducted bibliometric assessments for many years, and have produced important shares of the data set. These expert organizations were able to define technical standards with regards to enhanced quality of publication and citation data. Following the example of CWTS, NIFU and other expert organizations (e.g. the Italian Institute for System Analysis and Computer Science, the German Max Planck Society, and the German Competence Centre Bibliometrics) have invested in establishing in-house databases to clean WoS raw data. Bibliometric experts lacking equivalent databases cannot attain the same level of data quality, at least not for large publication quantities.
Perhaps more important than identifying the leading roles of CWTS and NIFU as bibliometric expert organizations, my analysis unequivocally documents the predominance of TR/ Clarivate in terms of defining methodological standards for the performance assessment of public research in Europe. During the observation period, the provider of WoS assumed the most important role in defining de facto standards for bibliometric assessment. All expert organizations, including CWTS and NIFU, based their citation analyses on data licensed by Clarivate Analytics/Thomson Reuters, as did most other bibliometric experts. However, Clarivate Analytics (via its licencing policy) regulates the extent to which different user groups can access citation data. Moreover, WoS subject categories retained their function as de facto reference standards for bibliometric assessment. In addition, there is the effective dissemination of selected impact indicators via the Journal Citation Reports and Incites. Although academic bibliometricians have endeavoured to develop alternative categorizations of scientific fields [56,62,63] or more complex impact indicators, such efforts have had little impact on professional practice so far because they are not distributed alongside with citation data. I found a few examples of the use of alternative or supplementary sources, such as the specialized citation database Medline/Pubmed and the Norwegian documentation system CRISTIN. These examples further underline that bibliometric evaluation practice depends first and foremost on the data sources that are accessible for comparative analyses of research performance.
These findings suggest that the current ownership structure of the most widely used citation databases has had a restraining effect on the development and diffusion of professional bibliometric methods during the observation period. First, in the current situation, the development of new diagnostic techniques within the academic sector has remained largely disconnected from their application and diffusion in the professional field. A case in point is the development of alternatives for the WoS classification of science fields. A likely explanation for why alternative academic journal classifications have not spread further consists in that they can only be implemented with appropriate inhouse-databases. In contrast, WoS or Scopus science categories are distributed by database providers alongside with WoS / Scopus raw citation data. The CWTS publicationbased clusters, on the other hand, represent a professional solution, but because this solution is proprietary, it cannot be widely shared, discussed, and improved in the academic sector.
Second, I find that bibliometric methods have spread to a larger number of countries in Europe over the observation period. In principle, the scientific development in countries seeking to catch up scientifically as well as economically should not be assessed with second-rate methods. Lack of access to first-rate citation databases seems to have influenced the methodological choices of bibliometric assessments in countries such as Greece, Romania, Lithuania, and Slovakia. From these instances I infer that shared access to citation databases would be an important mechanism to better connect methodological developments from the academic sector with their application and diffusion in professional assessment practice.
The theoretical framework used in this study raises questions regarding professional control over the production of and access to high-quality citation data and analysis tools. Abbott asked how corporate control of expert commodities (in this setting, citation databases) will affect the future development of professional knowledge and practice [18]. It seems likely that an openaccess regime of citation databases would support the development and broad diffusion of more sophisticated bibliometric techniques for research assessment. Therefore, it seems promising to explore more explicit connections between methodological debates in evaluative bibliometrics and the ongoing open access transformation of the scientific publication system [64][65][66]. Further research and policy discussions should focus on whether and how open access to citation data could be provided in Europe.