Towards a Consolidation of Worldwide Journal Rankings — A Classification Using Random Forests and Aggregate Rating via Data Envelopment Analysis

The question of how to assess research outputs published in journals is n ow a global concern for academics. Numerous journal ratings and rankings exis t, some featuring perceptual and peer-review-based journal ranks, some focusing on object ive information related to citations, some using a combination of the two. This resea rch consolidates existing journal rankings into an up-to-date and comprehensive list. Existing app roaches to determining journal rankings are significantly advanced with the a pplication of a new classification approach, ‘random forests’, and data envelopment analys is. As a result, a fresh look at a publication’s place in the global research community is off ered. While our approach is applicable to all management and business journals, we specific ally exemplify the relative position of ‘operations research, management science, production and oper ations management’ journals within the broader management field, as well a s within their own subject domain.


Introduction and objectives
The ranking of academic journals is a highly contentious element of research assessment, and thus a widely debated foundation stone for the ranking of individual research outputs and university rankings [1,2]. As it affects people's careers and aspirations, the issue is one of perennial topicality and debate. Findings are repeatedly challenged as lists arguably bear non-intended consequences, skew scholarship and foster academic monoculturalism [3], and the methodologies underpinning the various approaches are contested as they are open to non-intended use [4,5]. Within business and management, in recent years we have witnessed an increasing proliferation of rankings, listings and productivity indicators, drawing the attention of a wide range of academic disciplines, including accounting, economics, finance, international business and marketing [6], of associations such as the Association of Business Schools (ABS and the Association to Advance Collegiate Schools of Business (AACSB), among others, but also that of dominant industry players such as Thomson Reuters' Web of Science, Elsevier's Scopus, and Google Scholar. These various parties are distinguished by unique interests. The commercial providers have started to monetize a rapidly expanding and lucrative global intelligence information business by building on the academic 'gift economy' [7] collecting institutional profile information and then selling it back to the institutions for strategic-planning purposes [8]. However, the aim of this paper is not to go into aspects of 'use and abuse' or epistemological positions regarding journal rankings [2,4]. Instead, given their broad adoption in today's academic practice, we address some distinct methodological shortcomings of the previous attempts to rank journals and contribute to the development of a more suitable methodology, which in turn, can be used to gauge the relative standing of individual journals more realistically.
There are three conventional ways of assessing journal quality: (i) subjective (perceptual), (ii) objective (citation-based) and (iii) a combination thereof (hybrid). All three feature well-known methodological limitations [9][10][11]. Recently, a fourth approach has gained momentumthe 'meta'-ranking approach -which, like the hybrid approach, is intended to provide a balanced view by delivering a composite journal ranking [cf. 12,13]. In contrast to the hybrid studies, which usually combine a few rankings or ratings and often involve the hand-collection of perceptual data, meta-analyses typically rely on a comprehensive selection of existing, in many cases reputable, rankings or ratings, and aim to deliver a reproducible outcome (cf. Table 1). As outlined, the existence of journal rankings is often -justifiablycontested on philosophical grounds, and there is the fundamental question whether possible distortions in terms of scholarship and unintended consequences of ranking exercises [see e.g.
2] may offset the advantages of increased manageability of scholarly outputs. Indeed, the emergence of meta-rankings can be seen as a result of the sheer volume and range of diverse lists that are -counter to the original motivation for developing them, which was to improve academic resource 'management' -proving to be unmanageable outside their respective academic institutions and often include different selections of journals. Within the academic community there seems to be agreement that if rankings are being used, the agenda should be the pursuit of a rigorous and objective perspective, based on state-of-the-art methodologies, free of individual stakeholder interests in this contentious area.
However, despite the advances made by meta-studies, a number of shortcomings remain. These include, (i) arbitrary inclusion or datedness of journal lists; (ii) over-reliance on citation data; (iii) limited coverage in terms of disciplinary focus, number of journals and number of lists included; (iv) inadequate treatment of missing data and unsophisticated imputation methods; (vi) treatment of ordinal rank data as metric; (vii) choice of ranking categories.
In the present study, we elaborate an approach that addresses these shortcomings while combining the strong features of existing studies, extending these and adding novel features. Therefore, we substantiate the methodological underpinnings to the current debate on journal rankings. We (i) extend recent work and offer an aggregate journal ranking based on a comprehensive number of journals, (ii) cover a significant number of disciplines within business and management, and (iii) deploy a unique methodological approach and integrate subjective and objective rankings with a focus on systematism and the production of comprehensive journal rankings. Specifically, this is the first meta-ranking to feature both the random forests framework (a non-parametric state-of-the-art predictive learning method) for missing data imputation and data envelopment analysis (DEA) (an established non-parametric approach to performance evaluation of peer entities) for the aggregation of rankings. This paper is decidedly focused on the methodological advancement of existing journal rankings.
Thus, our final aggregate journal ranking outcomes (see and Table 5) can be seen as frame of reference for a substantive discussion and objectification of journal rankings, which is otherwise rather politicized.
The paper is organized as follows. The next section provides a critical review of objective, subjective and hybrid approaches to journal ranking and rating. Following this, Section 3 provides an overview of the major meta-ranking studies. Subsequently, in Sections 4 to 6, we present our novel meta-approach to journal ranking and rating, discuss its specific methodological advancements and apply it to our data set of journal rankings and ratings.
This involves dealing with issues of database compilation, data missingness and imputation methods, classification trees, random forests and the subjection of the data to DEA. Section 7 concludes with a discussion of main results of our study and their implications. Appendices provide full modeling and computational details.
With particular emphasis on operations research, management science, production and operations management (OR/MS/POM), we apply the method to ascertain the relative positions of journals within the broader business and management discipline, as well as the relative position within the OR/MS/POM field.

Review of objective, subjective and hybrid approaches to journal ranking and rating
With regard to objective ranking, issues arise around the analysis of citation data. The Impact Factor delivered by the Journal Citation Reports [14] -defined as the number of cites received in the given year by an average article published in the given journal within the preceding years -is the most widely accepted citation-based measure for "significance and performance of scientific journals". It is widely acknowledged for its comprehensibility, robustness and availability [15]. Yet, it has received a considerable amount of criticism in the literature, connected to the accuracy problem in collecting citation data, undifferentiated treatment of citations, biases due to different maturing of published work across different journals, inaccurate definition of citable work and differing citation habits across different sub-disciplines. Further criticism includes biasedness towards journals with lengthy articles [15, see also 16]; and a selective disciplinary and geographical coverage [17,18]. Some of these deficits have recently been addressed by introducing a newer, prestige-oriented metric called Eigenfactor Score [19] which augments the Journal Citation Reports, and the emergence of Scopus -a citation database by Elsevier which offers a broader journal coverage together with new citation indices SNIP (Source-Normalized Impact per Paper) and SJR (SCImago Journal Rank). These aim to account for discipline-related citation habits and the prestige of the citing journals, respectively [20,21]. Yet, and despite these advancements, extensive discussions of the underlying methodological issues raise concern of the sole reliance on citation-based analysis in journal ranking exercises. This is because important work may be considered as "common knowledge" and is sometimes left uncited -with acknowledgement given to other work or citation counts frequently representing simply fashion and herding within the academic community which implicates that citing does not necessarily imply influence [9,22,23]. There are also problems of selective citations and the opportunity for self-and mutual citations, a poor association between the quality of a journal and that of individual articles in it, as well as possible subjectivity which can be pertinent to the analysis based on the objective citation data [5,24,25]. Regardless of these shortcomings, the citation impact factor remains an important indicator in the academic community to assess journal quality.
Subjective, or perceptual, rankings are developed via opinion surveys among the experts within an institution, a society, or a research network and may be motivated by the needs to elaborate a basis for institutional decision making and evaluation purposes as well as to provide guidance within particular disciplines [1,26,27]. For these reasons, a variety of rankings exist which are tailored to the needs of a particular institution or a discipline [10,[26][27][28]. Generally, perceptual rankings alleviate the problems pertinent to citation data, and explicitly capture the perceived quality of journals [5,29]. On the other hand, they are prone to biasedness in the experts' judgments -due to the institutional focus or self-identification with particular journals [11,26]. Furthermore, the coverage of perceptual lists is often restricted to a particular discipline or by institutional preferences [26].
Due to the shortcomings of the above two approaches, the hybrid lists -which in some way combine subjective and/or objective data -have gained attention in the literature [e.g. 13,29,30]. Indeed, pooling data that originates from different sources helps to produce a more balanced view and is seen as a desired approach [13,27,31]. However, hybrid ranking lists typically have a particular disciplinary or geographical focus; they usually combine a few rankings or ratings and involve hand-collection of perceptual data, and, with a few exceptions, use unsophisticated and less principled techniques for data aggregation [cf. 1].
Because objective, subjective and hybrid approaches have attracted the above criticisms, the meta-approach to journal ranking and rating has recently received a substantial development, being intended to overcome the drawbacks of the hybrid approaches by relying on a comprehensive selection of existing, in many cases reputable, rankings or ratings, and aiming to deliver a reproducible outcome. Table 1 offers a compilation of the main journal meta-ranking studies. As can be seen, most of these studies focus on particular sub-disciplines, with the exception of Mingers and

Overview of journal meta-rankings and ratings
Harzing [1] and Halkos and Tzeremes [22] who take a cross-disciplinary approach. The journal coverage ranges from 25 to 229, with the exception of Mingers and Harzing [1] who cover over 800 journals. In terms of rankings used, most of the studies draw on a combination of subjective and objective rankings. Two thirds of the meta-rankings are based on journal rankings contained in Harzing's broadly accepted Journal Quality List (JQL) [32].
The number of underlying rankings is often 10 or less. There is quite a spread in terms of the recentness of the rankings, with only two studies covering recent years. As for data missingness, which arises because of selective coverage of journals, either this is not addressed, or it is not dealt with properly in these meta-rankings (see Section 5.1). For Theußl et al. [33] and Cook et al. [12], data missingness is not an issue. They effectively adopt the perspective that only the observed rank data can determine the ultimate ranking. There are a few, varied, attempts to impute missing data: for example, Bancroft et al. [34] employ a maximum likelihood approach, while Mingers and Harzing [1] implement a form of chained regression.
Insert Table 1 about here   As for the aggregation method for rating/ranking journals, the main approaches used   are scoring methods, cluster analysis and consensus ranking via integer programming, with only one study, that of Halkos and Tzeremes [22], featuring the state-of-the-art DEA. While scoring is attractive due to its simplicity, it is rather subjective in its application. Cluster analysis offers a more advanced approach, but usually only delivers a limited set of categories. DEA, in contrast, is a methodologically profound and objective approach that helps to reduce manipulation, over-interpretation and bias. The integer programming approach deployed by Theußl et al. [33] and Cook et al. [12] is very effective at producing a consensus ranking, yet it works within the confines of treating missing data as non-existent.
Further, it cannot deliver an interval or ratio scale outcome.

Compiling a database for journal meta-ranking
In view of the limitations and shortcomings of meta-rankings described above, we proceed to develop a comprehensive journal database, which will subsequently be subjected to our rating and ranking exercise.
The primary databases are the journal quality ranking lists contained in the 49 th edition of Harzing's Journal Quality List (JQL49) [32]  • We update and correct a number of the journal lists in JQL49 based on information in the most recent publicly available editions of the respective ranking lists. 2 • In order to capture a comprehensive quantity of journals, all journals listed in JQL49, a total of 939 journals, are considered. This provides a broad and cross-disciplinary coverage.
The 10 ranking lists selected for aggregation by means of DEA (Section 6) are labeled 'target lists', as shown in Table 2. In an additional step, these rankings are further augmented by including 2011 Impact Factor data from the Journal Citation Reports [14] 3 . Thus, we use 11 rankings in total.
Insert Table 2

about here
Most of the journal quality lists rank the journals on an ordinal scale, using differing numbers of scale gradations (ranks) and their designations. Thus, we relabeled the ranks in each of the lists as 1, 2, etc., from highest to lowest. The length of the original scale is maintained in all lists. This overcomes the problems related to adjusting original scale lengths to a common scale length, and the resulting subjectivity/arbitrariness [31].
In addition, all journals with an Impact Factor are ranked and divided into quintiles, with 1 denoting the top quintile, 5 the lowest quintile, and a value of 6 being assigned to journals that are not indexed in the 2011 Journal Citation Reports. This procedure helps to journals that it ranked below an A would have been wrongly recorded as missing cases. We also excluded Den 2011 (Danish Ministry Journal List) because it has only two categories: top journal and others, it is thus lacks differentiation. We further excluded FNEGE (Foundation National pour l'Enseignement de la Gestion des Entreprises) 2011 because it merely replicates the CNRS (Centre National de la Recherche Scientifique) 2011 ratings for management and business journals. Finally, we excluded AERES (Agence d'évaluation de la recherche et de l'enseignement supérieur) 2012 because it mainly maps CNRS 2011 ratings to a scale with fewer gradations and does not substantially add to the existing data. 2 We have in particular made corrections in the ranking lists ABS 2010, CNRS 2011, UQ 2011 and HEC 2011. These and other adjustments of the JQL can be obtained from the authors on request. 3 We use the two-year average of the 2011 Impact Factor [14]. An alternative would have been to use the five-year average. However, for a number of journals, no five-year average data exists. If we had used a five-year average, these journals would have received a non-entry, despite being included in the citation list. The same rationale applies to the exclusion of alternative measures such as the article influence score. alleviate several of the well-known shortcomings of using the Thomson Reuters metric score in analyses [cf. 18], as well as the problems with conventional normalization procedures [26].
5 Resolving the data missingness problem in journal rankings and ratings

Data missingness and imputation approaches
A significant problem pertinent to journal meta-ranking approaches is the considerable amount of missing data. In our database of 939 journals, target ranking lists from 1 to 10 (see Table 2) contain 4,770 entries out of the 9,390 possible. This corresponds to an overall missingness rate of nearly 50%. The pattern of missingness varies across journals, and coverage rates range from approximately 28% to 88% across lists. As can be seen from Figure   1, three strategies for dealing with data missingness can be identified in the existing journal ranking studies:

1)
Completing the data set. This can be achieved either by the removal of records with missing data -which would however lead to an undesirable loss of informationor by imputation. The latter involves replacing missing entries with artificially generated values (see below).
2) Averaging. For example, Rainer and Miller [35] [37]. Therefore, this paper considers imputation to be the most viable strategy for dealing with missingness in journal lists.
Insert Figure 1 about here In line with Farhangfar et al. [38] and Gheyas and Smith [39], three approaches to missing data imputation can generally be identified (see Figure 1): 1) Data-driven imputation methods [38]. Missing items are replaced with artificial values, for example the mean, median or mode of the respective variable, or with a random draw from the observed values [39,40]. However, these methods distort the association between variables [40]. In the context of the journal ranking problem, the approach would lead to the distortion of the aggregate ranks of individual journals. While this is partly overcome by Benati and Stefani [10], who associate missing rank data with a separate category, their approach is not tailored to offer a rank ordering of journals.
3) For the purposes of our study, we pursue the branch of non-and semi-parametric imputation methods, as these do not (or do not fully) rely on a data model [39,41]. A major advancement within this branch is the group of machine learning approaches [46], which we draw on for our study [see e.g. 39,47] 4 . In particular, the work by Twala et al. [48] demonstrates the competitiveness of tree-based methods compared to parametric imputation methods in terms of predictive accuracy, see also Hapfelmeier et al. [49]. More specifically, we utilize the random forests method [50] which represents a recent and remarkable advancement in non-parametric classification and regression. This method employs an ensemble of classification or regression trees (see Section 5.2) for predicting the response variable as a committee, while the process of constructing the individual trees in the ensemble involves randomness. This approach results in a superior prediction accuracy that compares favorably or competitively 'to the best statistical and machine learning methods' [51][52][53]. At the same time, the random forests method is deemed more versatile than the conventional statistical methods and can flexibly accommodate a wide range of prediction problemseven those that are 'nonlinear and involve complex interactions' [53], while being 4 Gheyas and Smith [39] provide an overview of imputation approaches, and in particular those featuring neural networks. However, we do not consider this group of methods in our study, preferring instead a methodology which is more straightforward in its application.
acknowledged, among others, for robustness and ease of training as compared to other machine learning methods [52,53].

Classification trees and random forests and their application
Classification and regression trees (CART) represents a well-established and widely used non-parametric predictive learning method [46,52], which has been developed with a strong emphasis on the possibility of missing data among the variables. It seeks to determine the association between the response and predictor variables via recursive, data-driven, partitioning of the predictor space and exhibits a degree of accuracy comparable to the best of the classical statistical methods [54], while producing highly interpretable models and exhibiting other strong advantages [52]. Breiman [50] has advanced CART to produce the random forests framework which effectively reduces variability of individual tree predictions by de-correlating and aggregating them across a tree ensemble, offering as a result a remarkably high prediction accuracy and a number of other advantages [52,53]. Random forests are particularly easy to train, basically requiring to fine tune a few parameters only. Drawing on the random forest framework, we proceed towards imputing the missing data in each of the target journal-ranking lists 5 . Imputation in each individual list is based on predictor variables which are comprised of: (i) journals' subject areas as per JQL49; (ii) the remaining target lists 6 , (iii) other journal ranking lists included in JQL49, and (iv) Citation Impact Factors from the Journal Citation Reports (see Table 2 and Table 3). Specifically, we utilize ranking lists from 2001 onwards (see Table 3). Although these are older than the cut-5 Table 2 exhibits 11 target lists. Imputation has to be carried out in 10 of these. 6 As indicated in Section 4, target ranking list no. 11 is based on 2011 Impact Factor data [14] and features an ordinal rank scale with a few gradations for the purposes of aggregate ranking. When acting as a predictor variable for missing data imputation, this ranking list however maintains original ratio scale data of the 2011 Impact Factor if the latter is available, and indicates a missing value otherwise.
off date for the target lists, and are therefore based on more historical data, their inclusion is warranted to improve imputation accuracy. 7 Insert Table 3 about here The first step in the application of random forests is to (i) pre-impute missing entries in each single predictor. 8 This task is necessary as the predictor variables themselves have missing values. While random forests have a built-in mechanism for this step, we use CART to accomplish this task. 9 The second step involves (ii) checking the imputation accuracy in the target lists using cross-validation [see e.g. 52]. We find differences in the accuracy of the imputations for different ranking lists. For instance, missing values for Ast 2008 are found to be more difficult to predict than missing values in other lists. Additionally, we perform numeric experiments using different settings for CART and random forests to determine the optimal parameter settings for the imputation engine. The third step is (iii) the actual imputation of missing data in the target lists. Having regard for misprediction rates in all of the target lists in step ii, we find that it would be inappropriate to stick to the point estimates of missing rank data; instead, the uncertainty involved must be reflected in rank predictions.
We therefore adopt, similarly to Zhou et al. [30], a fuzzy rank approach -by letting each journal belong to two or more different ranks within the same ranking list, while the respective degrees of rank membership are required to sum up to unity (e.g. in ABS 2010, journal X is 60% associated with rank '1' and 40% with rank '2'). A particular advantage of 7 Although VHB 2011 and UQ 2011 are included in the primary list, VHB 2003 and UQ 2007 are also used for imputation purposes because they use different methodologies, scoring systems or ranking procedures from the newer versions of the lists [for details and a discussion of the VHB and UQ lists, see 32]. 8 All necessary computations have been conducted in R software environment (version 3.0.0). We have used CART implementation delivered by the R package rpart (version 4.1-1) and the implementation of the random forest method delivered by the R package randomForest (version 4.6-7). 9 This approach had to be adopted because randomForest package (see footnote 8) does not allow for missing data when predicting an unknown response. Handling such situations is however an inherent feature of CART, thus the said approach has been adopted (see e.g. Hastie et al. [52, p. 333]). After preimputing the missing values in the predictor variables, we add one dummy variable per each such predictor to indicate whether the respective predictor value is original or has been pre-imputed.
this approach is that our aggregate ranking method (see Section 6 below) accommodates fuzzy rank membership in a natural way.
Notably, random forests have a built-in mechanism for estimating individual rank probabilities when making a prediction. We accordingly adopt these probabilities as the respective degrees of rank membership predicted for the given journal in the given ranking list. Random forests exhibited a superior performance in producing such estimates [55]; however, that performance can be further improved by means of calibration techniques. For this purpose we have employed the calibration method suggested by Boström [56] and similarly used the Brier score (mean squared deviation of the predicted rank probabilities from the true ones) as performance measure, while the calibration data set has been comprised of all test data samples which had been formed in the course of cross-validations conducted in step ii. In our experience, calibration has yielded only a marginal improvement of the Brier score, which is in line with Niculescu-Mizil and Caruana [55]. By completing this step we have produced a comprehensive and complete data set, which is then subjected to DEA.
Appendix A provides details to the individual steps of the above imputation procedure.

Rating and ranking journals by DEA
DEA [57] represents an established management science approach to multi-attribute rating of peer entities [58][59][60], in our case journals. A typical DEA setup involves measuring the efficiency of a number of peer entities called decision-making units, or DMUs (e.g. universities) that have a number of common inputs (e.g. budgets, number of staff) and outputs (e.g. research outputs, teaching quality). These inputs and outputs constitute the basis for evaluating the efficiency of the DMUs. There are no a priori weights attached to the inputs and outputs. Instead, DEA offers each DMU an opportunity to cross-evaluate and apply input and output weights that most favorably express its own efficiency. Essentially, DEA determines 'frontiers rather than central tendencies' in the data [58], [61]. As a nonparametric method, it requires no a priori assumptions on the interaction between the variables in the data set [58].
Conventionally, the DEA methodology is applied to metric data, but it has been extended to cover a variety of settings with ordinal rank data [see 62 for a recent discussion]. performance. Furthermore, the weights chosen by the journal also determine performance ratings of all other journals from its perspective. Thus, by choosing its own rank weights, each journal explicitly evaluates itself vis-à-vis all other journals. In this way, a crossevaluation matrix is obtained, from which the ultimate ratings of the individual journals can be derived [64,65].
Due to the DEA's advantage of avoiding a priori assumptions and subjective bias, we adopt the above approach to derive an aggregate journal rating and ranking. To this end, we employ the DEA framework for aggregation of ordinal preferences by Green et al. [64] while further extending it to include a rank discrimination threshold in line with Noguchi et al. [65] and a differentiated treatment of individual rankings as in Cook et al. [63]. In addition to that, we enforce convexity constraints on the rank weights in line with Hashimoto [66]. Further, we use the aggressive form of cross-evaluation [64] to give each journal the opportunity to appear most strongly against its peers, and derive the ultimate journal ratings from the cross-evaluation matrix using the arithmetic means so that all journals have an equal say in determining the final result. Section B.1 in Appendix B provides specific details to the DEA model adopted in our study and to the cross-evaluation method.
As explained in Section 4, we subject 11 target ranking lists to the above aggregation procedure, while the missing rank data has to be imputed in these lists by means of the random forests method as per Section 5.2, and supplied to DEA in the form of fuzzy membership degrees to which the respective journal is associated with the individual ranks of the respective ranking list. This represents a distinctive feature of our model as compared to DEA approaches to ordinal rank data [62][63][64][65][66]. Random forests method can impute the fuzzy rank membership in a natural way and ordinal DEA can also accommodate fuzzy rank data.
Thus, these two approaches are complementary to each other for the purposes of aggregate journal rating.
Before proceeding with DEA, we exclude from the final list of journals used in this study those journals with ranks available for less than 25% of the 11 target lists (see Table 2).
This reduces the list from 939 to 786 journals, representing around 84% of all journals in JQL49. This approach is taken because the ranks that are available for sparsely ranked journals may not be representative enough, and it also ensures that the imputations are 'pluralistic' enough rather than being based on just one or two rankings. Our conservative choice of this lower limit of 25% for the number of original rankings per journal is in line with previous related studies, such as Cook et al. [12] and Theuβl et al. [33]. 10 A particular problem in attaching weights to the individual journal ranks in our DEA exercise is the arbitrary choice of a rank discrimination threshold to separate the weights of any two consecutive ranks [see 62,64,65,67]. If the threshold is virtually '0', this leads to the undesirable suggestion that there may be no difference between any pair of journal ranks. 10 We found that the final results remain robust when this lower limit is set to a higher value, e.g. 35%.
If the threshold value is set to the maximum, this infringes on the spirit of DEA, since it largely restricts the freedom of choice in determining the rank weights [64]. We resolve this dilemma by setting up the process so that journals settle on an intermediate value of the threshold via Nash bargaining [68] 11 . Section B.2 in Appendix B provides specific details to the implementation of this procedure. Accordingly, we find the compromise value of the threshold to be 31.3% of the maximal possible value. We then use DEA to rate the journals, producing in effect rating scores in the range from 0.55705 to 1, which yield 729 unique ranks, with 786 tied ranks. Table 4 and 5 offer a selection of the results.
We also conduct a series of tests to address the sensitivity of the final rating to the choice of the rank discrimination threshold. We find that the results differ across the entire range of feasible threshold values -with Pearson correlations among the corresponding ratings ranging from 80.4% to 100% and Spearman rank correlations from 79.7% to 100%. At the same time, the rating remains robust in the proximity of the selected threshold value; neither of the above two correlation measures falls below 99.97% within the range of ±10% around the selected threshold value. The final rating exhibits a Pearson correlation of 88.2% and a Spearman rank correlation of 89.9% with the rating produced by means of the Borda count -a points-based system that specifies equidistant weights for the individual ranks in each of the ranking lists.
Insert Table 4 and Table 5 around here 11 To be specific, we consider a bargaining problem with n = 786 players [68] where journals are acting as players. The utility that a journal attaches to a particular threshold value is taken to be its own standing in the DEA rating that arises under this threshold value. A journal's standing is defined as the difference between this journal's rating score and the average one across the list, normalized to account for the length of the rating scale. The analytic form of each journal's utility function is obtained by fitting a cubic polynomial to 10 equally spaced data points computed for each journal within the feasible range of the threshold. Further, instead of using the disagreement point in the sense of the original Nash bargaining problem, we refer to the minimum utility point [69] -where, accordingly, a journal's minimum utility is its lowest possible standing throughout the entire feasible range of the threshold. The bargaining solution is then determined as the threshold value that maximizes the Nash product over the entire feasible range [see also 70].

Conclusion and implications
The debates over the use and abuse of journal rankings are heated and have recently heightened in their intensity. Much of the effort in the scholarly exchange regarding these rankings is concerned with the construction and publication of list data. However, fundamental issues related to epistemological positions and their implications for scholarly exchange and the scientific production system [71] are still to be resolved [72]. This paper empathizes with these concerns and criticisms in relation to issues such as the homogenization of research cultures, the reduction of pluralism, the skewness of scholarship and the polarization and entrenchment of orthodoxies [2], to mention just a few.
Notwithstanding the importance of the wider and philosophical discourse, the main contribution of this paper is a methodological one, driving the advancement of journal rankings. Our position is that, if journal rankings are here to stay, we better pursue a rigorous perspective based on state-of-the-art methodologies that transcend the individual stakeholder interests in this contested field.
With this paper we provide a meta-ranking that overcomes some of the specific shortcomings of the existing meta-rankings in terms of the construction of the underlying database, the treatment of missing data and the ranking approach. To the best of our knowledge, this is the first study to go beyond previous ranking snapshots, and to uniquely feature a combined application of the random forest framework and DEA, two established non-parametric methods, in the construction of the aggregate list. This makes our study wholly non-parametric and therefore free from subjective a priori assumptions about the interaction between the various ranking and rating data included in the study. In this process, we ensure that we retain the strong features of existing and relevant methods, extend them and add novel features (such as fuzzy rank membership and rank discrimination via Nash bargaining) so as to arrive at a 'state-of-the-art' meta-ranking. Confidence in our findings is established through a series of extensive robustness checks, and reliability and crossvalidation procedures. However, despite the recency of our methodological approach, future work may still direct its attention towards some possible extensions. For example, it could be explored whether a form of 'discounting' or weighting should be introduced for the imputed journal ranks, due to their omission from the original ranking studies. In our research, they are treated on an equal basis to the existing ranks. Table 4 offers a selection of the final aggregate journal ranks. We deliberately refrain from making any judgement as to the quality of various ranks, or the 'star-rating' of certain journals, as is frequently found in other ranking lists. We simply provide a rank-ordering of the journals along with their numerical ratings, leaving stakeholder or user groups to arrive at their own subjective judgments regarding the cut-off points for quality grades. There are also a number of useful applications of this list. It allows for the relative standing of a particular journal to be ascertained vis-à-vis all other journals, as well as within its own subject area.
Based on our meta-ranking, Table 5 Tables 4 and 5). On the other hand, OR/MS/POM journals account for only around 7% of both, the lower third and lower quartile of journals in our meta-ranking list.
Besides, our meta-ranking may also serve as a reference point onto which the grade and/or star-rating of a particular journal or the population of journals in other lists (e.g. ABS, VHB, Cranfield) can be mapped (see Table 5). This allows pinpointing whether there is congruence between the journal grading of those lists and the results of our meta-approach.
Since we have deliberately refrained from attaching grade categories to our journal rankings, the interpretation of such a comparison lies in the eye of the beholder. However, if we were to find gross discrepancies between our meta-ranking and other journal ranking lists, this would not be easy to argue away. Instead, it may serve as an invitation to the authors of the journal list in question to revisit their assessment and ameliorate such discrepancies.
While our list is certainly not a panacea, we introduce a 'dose of objectivity' into some of the issues picked up in the wider debates on journal rankings, such as vested interests, gamesmanship and politicking. To this end, we hope to contribute to shifting the discussion back towards the essence of scholarly endeavours, namely the development of interesting and relevant contributions. Legend: * Ranking of just a single journal within a single discipline with 285 journals is provided as an illustration ** As of 21 April 2009 † As for the number of journals in the reported ranking P -Perceptual rankings published in academia OS -Opinion survey as source of perceptual data JQL -Cross-disciplinary rankings present in the JQL OI -Other institutional cross-disciplinary rankings C -Citation data or citation-based rankings U -Rankings featuring other usage data (e.g. download counts, citations in syllabi etc.) The symbols are in descending order of the respective rankings' share in the data set.

IP
-integer programming DEA -data envelopment analysis

Appendix A: Missing data imputation
As indicated in Sections 4 and 5.2, our data set is comprised of 939 journals scoring in 11 target lists (Table 2) and 14 additional lists (Table 3) Tables 2-3.
For convenience, we may interchangeably refer to j J ∈ as cases and to L ∈ ℓ as variables.
) represent the score of j-th journal in ℓ -th list 12  Imputation of missing values is then conducted by taking each single I ∈ ℓ as dependent variable (response) and \{ } V ℓ as independent variables (predictors), and making inference about the missing values in ℓ from the predictor values using the random forests method. As indicated in Section 5.2, this is accomplished in three basic steps, which we 12 The score represents the journal's rank if the respective list ranks journals on an ordinal scale, and its rating if the list rates the journals on an interval or ratio scale. describe below in detail. All necessary computations have been conducted in R software environment [75] (version 3.0.0).

A.1 Pre-imputation of missing data in predictors
As predictors from L exhibit missing values on their own, we first pre-impute missing data in every L ∈ ℓ while treating \{ } V ℓ as predictors. Following Hastie et al. [52] (see also [76]), we employ classification and regression trees (CART) to accomplish this task.

A.1.1 Overview of CART
Classification and Regression Trees (CART) [77] represents a widely used non-parametric method of supervised learning -i.e., learning from data about how do certain input variables (predictors) take effect on certain output data (response), for the purpose of correctly predicting or estimating the response from the predictors' values [46,52,78]. In this context, prediction of a numerical response (measured on an interval or ratio scale) is being termed regression, whereas classification deals with a categorical response (measured on a nominal sale). The CART method is capable of doing either type of learning and, in addition to that, has been developed with a strong emphasis on the possible data missingness in the predictors.
It exhibits at the same time a degree of accuracy comparable with the best of the classical statistical methods [54] while producing highly interpretable models -without requiring to make a priori distributional assumptions for the data. Further, it does not require transformations of predictor variables, allows any mixture of the variables, is resistant to the presence of outliers and irrelevant variables, and is fast to train [52,78]. For these reasons, CART has been adopted in many applied areas and is a most popular predictive learning method used in data mining [52,78].
CART produces a data model in the form of a binary tree which is grown in the topdown fashion by recursively partitioning the data. The construction of the tree (tree fitting) proceeds starting from the root node -which is associated with all of the observations contained in the data set. A node is split by selecting a particular predictor variable and partitioning its range into two subsets -which respectively define the left and the right branch descending from that node; the observations attached to that node are accordingly separated into two groups which become associated with the respective child nodes. The choice of the variable and its partition is made in a way that would maximize the efficiency of the split. For a categorical response, this corresponds to the greatest possible reduction of heterogeneity of the response among the observations at the node -called the node impurity, for which there are several different measures available [52,79]. Specifically, a best split achieves the greatest possible impurity reduction -which is measured by averaging the impurity among the two child nodes. In simpler words, the goodness of a split is determined by the extent to which the discrimination between the predictor values helps to discriminate the response. For a numerical response, the node impurity is the sum of squared deviations of the response from its mean value at that node. The branching of nodes continues either until a zero impurity is achieved or until there are only a few observations arrived at a node. The generated tree can then be used to predict the response from the new predictor values: each such case is run down the tree by applying the branching rules generated during tree fitting; the terminal node at which the given case arrives determines the prediction for this case -as the majority value of the response among those observations which have ended up at that node during the construction of a classification tree, and respectively its mean value when doing a regression [52,53,[80][81][82].
However, the tree grown to its maximum size may overfit the training data and not generalize well; on the other hand, a too small tree may not capture enough dependencies in the data [52,53]. Tree pruning is accordingly undertaken to strike a balance between the complexity and predictive capability of the tree by successively pruning its branches and estimating its resulting predictive accuracy via N-fold cross-validation -which proceeds by partitioning the observations in the data set into N approximately equal groups and then successively removing each group from the data set, fitting a new fully-grown tree, pruning it to the complexity level in question, and using it to predict the response in the removed observations. Predictions are then compared to the true responses to obtain the crossvalidation error over all groups of observations. In this way, a sequence of trees of different complexity (ranging from the fully grown tree to the single-node one) is evaluated, and the complexity level with the smallest cross-validation error is ultimately chosen; alternatively, the smallest tree with the error within 1 standard deviation from the minimum can be chosen as well [52,[80][81][82][83].
Classification trees further allow to specify a prior distribution for the response categories and misclassification costs matrix to distinguish between the severity of wrongly classifying response categories; these settings take effect on the evaluation of the node impurity, the prediction at a terminal node and the prediction errors. Furthermore, CART implements a mechanism that flexibly accommodates missing data in the predictors. This is being accomplished by looking for surrogate variables at every node split: specifically, after a node split has been produced with a particular predictor variable (primary splitter) and its range partition (split point), another predictor is being sought with a suitable split point which would most closely mimic the split achieved with the primary splitter at this node; this defines the 1 st surrogate. In the same way the 2 nd best surrogate is being determined, and so on. Then each time when an observation requiring a prediction is lacking the value of the primary splitter at a particular node when being run down the tree, the 1 st surrogate will be utilized to properly send this observation further down; if the value of the 1 st surrogate is missing as well, then the 2 nd surrogate is used, and so on. Hence this mechanism is trying to benefit from the correlations within the data to universally allow missingness while effectively compensating for it [52,[80][81][82]. 13 For the above reasons, CART is suggested to be an ideal choice for the imputations of missing values in the data set [52].
The CART approach to learning has however the following drawbacks: 1) predictions being sharply discontinuous across the individual regions of the predictors' space due to recursive partitioning; 2) instability with respect to small variations in the data and, by that, a high variability of predictions; 3) difficulties in capturing additive structures in the association between the response and the predictors; 4) fragmentation of data -which may cause certain relevant predictors to be disregarded if there are relatively many of them, resulting in a lower accuracy as compared to the best available methods, and 5) potential bias in variable selection towards variables with many distinct realisations and those with many missing values [52,53,78,84]. Still, the above indicated advantages of CART, in particular the high interpretability of the tree models and its non-parametric approach [52,53,78], have secured its broad adoption in many applications [46]. A substantial research effort has further been undertaken to address some of the limitations of CART [85,86].

A.1.2 Application of CART
As indicated above, the CART method has been adopted to pre-impute missing values in the variables L ∈ ℓ . We use for this purpose a CART implementation delivered by the R package rpart [80,87] -"the de-facto standard in open-source recursive partitioning software" [86].
Note that the following variables in L represent journal rankings and are therefore measured on an ordinal scale: , whereas the remaining variables 1 0 \ L L L = rate the journals on an interval or ratio scale. 14 13 Notably, CART can also implement node splits which are based on a linear combination of the variables instead of just a single one; in this case, however, the predictors are not allowed to have missing data. 14 We interpret the scale of the journal ranking list BJM 2004 (see entry no. 3 in Table 3) as an interval one.
Missing values are accordingly imputed for variables in 1 L by means of regression is successively treated as the response variable with predictors \{ } V ℓ , and a regression tree is grown given all such cases j in the data set for which j r ≠ Μ ℓ . By the same approach, missing values are imputed for variables in 0 L using classification trees.
Regarding the latter, we employ the rpart's default Gini node impurity measure (see also Section A.1.3 below for a further discussion). Following Loh [79], we further specify the costs of misclassifying rank r to rank r′ as a loss matrix C with | | rr C r r ′ ′ = − , to account for the ordinal nature of the response variable. Furthermore, higher ranks in ℓ are typically less populated than middle ranks (e.g. the journals with the highest rank typically represent a small fraction of all journals ranked in ℓ ), while their misprediction should be taken more seriously. To account for the underrepresented ranks and thus redistribute the misclassification error between the ranks, we employ case weighting; the weight attached to cases with rank r is taken to be , While growing a tree of either kind, we allow as many surrogate variables at node splits as many are present in L apart from the response and the primary splitter. Each tree is initially grown to the maximum depth by setting rpart's control parameter cp to 0; splits of nodes with less than 10 observations are not attempted. Then, tree pruning is conducted by means of the 10-fold cross-validation, while we prefer to stick to the tree with the smallest cross-validation error [84]. Finally, the tree is used to predict the response in all cases where its value is missing.

A.1.3 Notes
A more appropriate choice of the node impurity measure for classification trees would be the one that respects the ordinal nature of variables in 0 L . The original CART monograph introduces two such ones: ordered twoing and symmetric Gini [82,88]. However, none of them is implemented by the rpart package. This purpose serves rpartOrdinal -an R package by Kellie J. Archer [89] that implements ordered twoing and an ordinal variant of Gini impurity measure suggested by Piccarreta [88]. However, this implementation does not accommodate missing data in predictors [Archer, personal communication]. Further, Twala et al. [48] introduced a novel approach to handling missingness at node splits that has exhibited an excellent performance. However, its implementation has not been available to us.
Exploring these options represents an interesting opportunity for the future work.

A.2 Imputations with random forests and accuracy validation
Having completed the data set by means of CART, we now re-impute those values which have been originally missing in the variables I ∈ ℓ . These imputations are accomplished by means of random forests -a novel predictive learning method that delivers, among a number of other strong features, a superior predictive accuracy.

A.2.1 Overview of random forests
Random forests [50] represent an ensemble learning method in which a number of classification or regression trees (depending on the task) comprise an ensemble that predicts the response as a committee -by the majority principle in classification tasks, or by averaging over the individual predictions of the committee members in regression tasks [52,90]. Tree-growing in such ensemble involves randomization: firstly, the training data for an individual tree represents an equal-sized bootstrap sample of the original data set, obtained by a random draw from the latter with replacement. This approach to building a tree ensemble is known as bootstrap aggregation, or bagging [91]. As individual trees exhibit a high variability (cf. Section A.1.1), bagging can remarkably improve on their predictive accuracy -by reducing the variance via aggregation of predictions within a tree ensemble [52,91].
Secondly, in addition to bagging, random forests inject a further randomness to the process of tree growing -by taking only a random selection of predictor variables into consideration when making a node split. This approach helps to reduce correlation between individual trees in the ensemble; if their prediction strength is not restricted by that too far, then this leads to a significant improvement of predictive accuracy of the tree ensemble [50] -what makes random forests "competitive with the best available methods and superior to most methods in common use" [92], [52,53]. The number of predictors to be selected randomly for nodesplitting in classification tasks (our primary concern) is recommended to be m is the total number of predictors [52,90]; however, the performance of random forests remains quite insensitive to this choice over a wide range of values and can be excellent with a random selection of just 1 or 2 predictors, either [50,52,90]. As random forests benefit from the variability of individual trees, all trees are grown full and thus require no pruning (we refer the reader to [52,53] for a more detailed discussion of this strategy with regard to the possibility of overfitting).
Apart from having a strong predictive accuracy, random forests offer a built-in measure of the prediction error -the out-of-bag (OOB) error estimate -which is computed on-the-fly during the construction of the forest and makes the user free from the need to additionally validate the prediction error: since bootstrap sampling leaves out each single case about 36% of all times, the predictions by those trees for which the given case did not enter the bootstrap sample can be aggregated and compared with the true response value -thus producing an estimate of the forest's prediction error rate by averaging over all cases in the data set. As soon as the OOB error rate stabilizes with the growing number of trees, it represents an unbiased estimate of the generalization error [50,52,90].
Furthermore, random forests are robust with respect to the noise in the response variable [50] and deliver a number of further advantages: a case proximity measure, outlier detection, clustering, and a novel variable importance measure, among others [92]. At the same time, random forests are particularly fast and easy to train, requiring to fine tune a few parameters only [90], and can effectively deal with a large number of predictor variableslarge even when compared to the number of cases in the data set. Being at the same time a non-linear and non-parametric technique, random forests allow application to a wide range of problems, even if they are "nonlinear and involve complex high-order interaction effects" [93]. For the reasons indicated, random forests have gained a fast adoption in many areas since their introduction [53,93]. We refer the reader for more recent studies of the variable importance measure to Strobl et al. [93][94][95] and Hapfelmeier et al. [96].

A.2.2 Application of random forests
As indicated above, we now re-impute the originally missing values in each of the variables by means of random forests. We use for this purpose an implementation of the method delivered by the R package randomForest [90,97]. For each I ∈ ℓ , we utilize variables in \{ } V ℓ as predictors and, in addition to that, we introduce one dummy variable per each predictor with pre-imputed values -indicating whether the respective predictor value is original or imputed. Hence there are each time altogether | | 1 | | 1 48 variables. We use the following parameters in constructing the forests: number of trees equal to 500, number of randomly selected predictors equal to m     , and the minimum node size equal to 1 observation. These settings are default for randomForest. Furthermore, we utilize class weights to balance the misprediction error between the individual ranks by the same approach as we used in Section A.1.2 for case weights.
One of the features of random forests is a novel mechanism for handling missing data in predictors that is based on the random forests' case proximity measure [97]. However, the implementation of the method by the randomForest package does not allow for missing data when predicting the response. Mainly for this reason we stick to the strategy of pre-imputing missing values in the predictors by means of CART as described in Section A.1. We cannot thus rely on the forests' built-in OOB error rate to estimate the accuracy of imputations, and have conducted 10-fold cross-validations of prediction error (see Section A.1.1) delivered by the combination CART + randomForest in each variable I ∈ ℓ . We have repeated these cross-validations 10 times and averaged the misprediction error rates over the trials. , exhibiting a long bar on the right which indicates that a relatively large fraction of journals consistently cannot be ranked same as they appear in the respective ranking lists -at least with the data underlying the present study.

A.2.3 Notes
An alternative random forest method to use for imputations would be the conditional inference forests offered by the R package party [98]. These forests are comprised of trees which implement a conditional inference approach to node splitting [86,99] and can accommodate missing data in the predictors by using surrogate variables as in CART (cf. Section A.1.1). Hence unlike the random forests implementation by the randomForest package, conditional inference forests allow for data missingness when predicting the response. They can therefore be applied to impute missing values in the variables I ∈ ℓ without the need to pre-impute missing data in the predictors, and have performed favorably  in a series of tests in [49]. Furthermore, they can naturally treat ordinal response variables [86,98] and offer an unbiased variable importance measure [93][94][95]. However, the latter issue has not been a concern in our study, thus we chose the combination CART + randomForest for a better predictive accuracy. For comparison,  15 and, in addition, with a maximum possible number of surrogate variables allowed at node splits, and case weights assigned as described in Section A.1.2).

A.3 Actual imputations
Having tested the accuracy of imputations on the existing rank data, we now turn to actually imputing those values which have been originally missing in the variables I ∈ ℓ . Given however the magnitude of the cross-validated error rates reported in Section A.2.2, the accuracy of the predictions to be obtained should be questioned. Random forests predict a categorical response by the majority principle -i.e., by choosing that response category which is being predicted (voted for) by the largest fraction of trees in the ensemble. If the true response is actually known for the predicted case (e.g. when dealing with a test data sample), then the difference between the fraction of correct votes and the largest fraction of votes for any other response category defines the margin of the prediction delivered by the forest [50,15 Since missing values need not be pre-imputed in the predictors when using conditional inference forests, we leave dummy variables out (cf. Section A.2.2) and thus have each time | 2 | -1 6 V = predictors in the course of imputations. The default number of predictors to be randomly selected for node-splitting is set in the party package to 5. This number coincides at the same time with the setting recommended for random forests (cf. Section A.2.1) and we therefore stick to the default.  100]. A positive margin means a correct prediction; the greater its value, the stronger is the confidence in the prediction. The average value of the margin attained on the test data determines how well will the ensemble generalize -i.e., predict the response variable when its true value is unknown [101].
In the latter case, the difference between the largest and the second largest fraction of votes in the ensemble can be assumed to serve as the margin of the prediction. While conducting the cross-validations in 10 independent trials as explained in Section A.2.2, we have tracked how this assumed margin is associated with the probability of a true prediction. Given this prediction uncertainty, it would not be consistent to stick to the point estimates of the missing journal ranks as predicted by the forests; instead, the uncertainty must be reflected in predicted ranks. We therefore adopt, similarly to [30], a fuzzy rank approach -by letting each journal belong to two or more different ranks within the same ranking list. We accordingly define the rank membership as the probability of the given journal belonging to the respective rank. Notably, random forests provide a built-in estimate for such probability as the fraction of trees predicting the respective rank.
As indicated in Section 5.2, random forests exhibited a superior performance in producing such estimates [55]. The Brier score -defined as the mean squared deviation of the predicted rank probabilities from the true ones -is commonly used as the respective quality measure for predictions given in terms of probability estimates [56]. Specifically, let be a particular target list and represent a subset of journals scoring in the list ℓ . Let R ℓ denote the number of rank gradations in the ranking list ℓ , and let jk f ℓ be the probability estimate for journal j J ∈ ℓ to belong to rank k in this ranking list -so that The optimal value of parameter p for the target list I ∈ ℓ is then determined as follows (to simplify the presentation, we will below suppress the subscript ℓ in the notation if it remains unambiguous). Let * ( ) arg max{ } jk k k j f = represent the rank of journal j which has been voted for by the majority of trees in the respective random forest; the ties are broken by picking the highest rank. By substituting ˆj Expressing the first derivative of ˆ( ) P p and setting it equal to zero yields the first-order condition for the optimality of p :

Boström's calibration method no. 2 and its application
This method suggests to replace the constant calibration parameter p with a non-decreasing function of probability * , ( ), j k j f ℓ . We follow Boström [56] and use a sigmoid function which we define as  [56] grid search approach for determining (sub-)optimal values of parameters A and B . To determine the ranges for the parameter values over which the search should be performed, it is instrumental to observe the following properties of ( ) jk p f ℓ : • a higher absolute value of A leads to a steeper initial increase of the function (cf. panels a and b in Figure A.3), with reasonable values of A being found in the range 100 0 A − ≤ ≤ ; • pushing B in the negative direction makes the graph start from a higher ordinate (cf. panels a and c in Figure   Of the above two calibration methods, the second one offers more flexibility in calibrating rank probabilities, however at the expense of more intensive computations; furthermore, in contrast to the first method, the second one obtains very likely only a suboptimal solution.  [55]. We can also see that calibration method no. 2 exhibits a slightly better performance (which is never worse than the performance of the first method).
We accordingly perform the calibration of the imputed rank probabilities in all of the ranking lists I ∈ ℓ . The actual imputations are then conducted in the list ℓ as follows. Having of all journals scoring in the list ℓ as the training data set with the response variable ℓ to construct a random forest as explained in Section A.2.2, and then use this random forest to impute missing rank data for all journals \ j J J ∈ ℓ . We then derive the rank probability estimates jk f ℓ for these journals as fractions of trees in the random forest that predict rank k for journal j in the given list. The imputation is The choice of the rank discrimination threshold has received a substantial discussion in the literature [see e.g. 62,64,65,67,103,106] as it is likely to severely affect the results produced. However, to our best knowledge, no universal and satisfactory solution has been suggested. Specifically, Cook and Kress [103] suggested to use the maximum possible value of ε , however their approach has been invalidated by Green et al. [64] as infringing on the fundamental principle of DEA. They have in turn suggested to use 0 ε = -what has been criticized by Noguchi et al. [65] as contradicting the basic purpose of ranking -the argument which we share as well (see also [67]). As a remedy for this problem, Noguchi et al. [65] have suggested their own formula for calculating the value of ε -which has however been criticized for its arbitrariness (see [107]). We share this criticism and adopt in the present work a novel game-theoretical approach to determining the value of ε which is presented in detail in Section B.2 below.
Note that model (B.1)-(B.5) represents a further departure from the approach adopted in [64][65][66]103]   with maximization, what we however find less suitable for the purposes of journal ranking. 16 The aggregate rating scores i A accordingly comprise the ultimate rating list of journals i J ∈ and further determine their aggregate ranking, on which both Section 6 provides further details.

Notes
Cross-evaluation is deemed a powerful extension of DEA and has attracted much interest in research and application, being praised for its ability to rank order the subjects (in our case journals) -a capability not offered by DEA per se [70,109,110]. Several different approaches to cross-evaluation have received discussion in the literature. In particular, an alternative approach suggested by Green 16 The reader is referred to [70] for an overview of these and other possible formulations of the secondary goal discussed in the literature. regard to its acceptability from the individual subjects' perspective. As a remedy for this issue they have suggested a game-theoretic approach to averaging the rating scores with weights determined via the subjects' Shapley value in a coalitional game. Adopting their approach would however render computations in our setting intractable; we therefore maintain the commonly adopted aggregation approach as per (B.7) while letting journals determine the final outcome in a cooperative fashion by choosing the rank discrimination threshold via nperson Nash bargaining -as explained in Section B.2 below. Furthermore, Wu et al. [110] pursued maximization of the subject's ranking position as the secondary goal in crossevaluations. While likely being computationally prohibitive in our setting, their approach represents an interesting opportunity for the future research. Our approach to obtaining the rank discrimination threshold in Section B.2 is interrelated with theirs in that we express a journal's utility in the bargaining game via its relative standing in the rating list. We refer the reader to [70] for a recent overview of other existing approaches to cross-evaluation which should not be treated here in a greater detail.
As a final note, we have conducted all computations necessary in this appendix using MATLAB (version 7.11.0) and its Optimization toolbox (version 5.1).

B.2 Determining the rank discrimination threshold
As discussed in Section B.1 above, the problem of determining a proper value for the rank discrimination threshold (denoted by ε in model (B.1)-(B.5)) has received a substantial discussion in the related literature, however none of the approaches suggested have proven to be suitable for the purposes of the present work (cf. Section B.1). The analysis of aggregate rating scores in our setting has revealed that the choice of the rank discrimination threshold affects different journals differently; in particular, increasing the value of ε improves the relative standing of some journals in the resulting rating list while worsening that of the others. Hence setting the value of ε exogenously would inevitably lead to arbitrariness in the results produced. To avoid such arbitrariness, we propose a novel game-theoretic approach to choosing the rank separation threshold that lets the journals in J jointly determine the value of ε via bargaining. We model the respective bargaining situation in terms of n-person Nash bargaining problem [68] and determine its outcome as follows below. 17 In the first step, we obtain the maximum value max We have found max ε to amount to approximately 0.01961 in our setting.
In the next step we obtain the utility functions ( ) j u ε which should respectively represent the utility that the journal j J ∈ extracts from a particular value ( ) In simpler words, a journal's standing represents its position relative to the average journal on the list, normalized by the length of the rating scale. Note that the normalization is necessary since different values of ε lead to different lengths of the rating scale. 17 In a similar fashion, Wu et al. [112] consider a cooperative game approach in which the subjects (in our case journals) seek to determine a set of common weights (in our case rank weights) via n-person Nash bargaining, and utilize the weights so obtained to calculate each subject's rating score.
We then repetitively produce the aggregate rating list at each of 10 equally spaced values of ε ranging from 0 to max ε . This yields, for each j J ∈ , a series of 10 data points that represent the utility function ( )  our 786 journals, and has the maximum value of 3.53%. In all of the latter cases, however, the respective absolute error is negligibly low (below 0.03). We adopt for these reasons the polynomial functions ˆ( ) j u ε as an excellent representation of utility functions of the respective journals j J ∈ .
In the next step, we describe the bargaining situation in terms of the n-person  Note that the bargaining set is comprised of all n-dimensional utility vectors induced by the feasible values of the rank discrimination threshold, whereas the disagreement point is the vector whose elements represent the minimum utilities possible for the respective journals [cf. 69,114]. Note further that the bargaining set is by construction connected and closed while being at the same time non-convex. Replacing it with its convex hull as in classical Nash bargaining [68] does not however prove to be a satisfactory approach in our setting because the nature of the given bargaining game does not allow its players (i.e., journals j J ∈ ) to treat a lottery over U -or, equivalently, a randomized choice of ε -as a viable bargaining outcome [cf. 115]. In simpler words, the journals cannot be assumed to be expected utility maximizers in the given bargaining game; instead, they maximize their utilities ˆ( )