Global Distribution of Google Scholar Citations: A Size-independent Institution-based Analysis

Most currently available schemes for performance based ranking of Universities or Research organizations, such as, Quacarelli Symonds (QS), Times Higher Education (THE), Shanghai University based All Research of World Universities (ARWU) use a variety of criteria that include productivity, citations, awards, reputation, etc., while Leiden and Scimago use only bibliometric indicators. The research performance evaluation in the aforesaid cases is based on bibliometric data from Web of Science or Scopus, which are commercially available priced databases. The coverage includes peer reviewed journals and conference proceedings. Google Scholar (GS) on the other hand, provides a free and open alternative to obtaining citations of papers available on the net, (though it is not clear exactly which journals are covered.) Citations are collected automatically from the net and also added to self created individual author profiles under Google Scholar Citations (GSC). This data was used by Webometrics Lab, Spain to create a ranked list of 4000+ institutions in 2016, based on citations from only the top 10 individual GSC profiles in each organization. (GSC excludes the top paper for reasons explained in the text; the simple selection procedure makes the ranked list size-independent as claimed by the Cybermetrics Lab). Using this data (Transparent Ranking TR, 2016), we find the regional and country wise distribution of GS-TR Citations. The size independent ranked list is subdivided into deciles of 400 institutions each and the number of institutions and citations of each country obtained for each decile. We test for correlation between institutional ranks between GS TR and the other ranking schemes for the top 20 institutions.


INTRODUCTION
The Transparent Ranking [1] by Cybermetrics Lab, beta version, was produced with the objective of trying to validate the use of Google Scholar Citations as a basis for obtaining a performance-based comparator list of research institutions and universities. It offers the possibility of open performance ranking and indicators without recourse to data that is behind paywalls, such as the Web of Science or Scopus data. Google Scholar Citations lists self-created profiles of individuals which display an individual's papers and their citations, machine updated by Google Scholar. Google Scholar does not state how many original journal sources it uses and additionally skims information off the net. As a result, GS may reflect non-peer-reviewed papers, both as source items as well as citing papers. This has two aspects. One relates to the 'quality', or lack thereof, of non-peer reviewed papers. For a long time now, the judgment of peers who are experts in the subject has been accepted as conferring acceptability to a scholarly work. This is augmented by having more than one peer review (two or three) to lend greater credence to the process of validation. Lately, however, more scientists are publishing on nonstandard platforms, such as online journals and open archives such as ArXiv or SSRN. To restrict evaluation to only peerreviewed journals would mean that all these additional papers would be missed. Finally, peer-reviewed journals are often subscription-based (alternatively Open Access) and therefore not accessible to a wide section of people. The use of the open format for Google Scholar based ranking of institutions implies that anyone can download the data and verify the calculations, whereas indicators which are not transparent and are based on priced databases are difficult to verify.
Google scholar has been widely written about and compared to other ranking schemes. [2][3][4][5][6][7][8][9][10][11] However, the new GS-TR sizeindependent Google Scholar Transparent Ranking has not been derived directly from Google Scholar, but from Google Scholar Citations (GSC). [12] With GSC Google rolled out a major enhancement in 2012, with the possibility for individual scholars to create personal "Scholar Citation profiles". These public author profiles are editable by the authors themselves. [13] Individuals, logging on through a Google account with a bona fide address usually linked to an academic institution, can now create their own page giving their fields of interest and citations. Google Scholar automatically calculates and displays the individual's total citation count, h-index and i10-index. According to Google, "three-quarters of Scholar Search results pages show links to the authors' public profiles" as of August 2014. [14,15] Prathap [16] has shown from the same data that the greater the scientific wealth of a nation, the more it will concentrate it in a few premier institutions.
In the following sections we will first describe the data and then examine how institutions of different countries are distributed in terms of citations as computed according to Google Scholar's Transparent Ranking (GS-TR) 'size-independent' methodology. Secondly, we will obtain the rank correlation of GS-TR ranks and those of other popular ranking schemes. Results are presented, followed by a section on the discussion.

Data
Data was taken from a list of institutions ranked by Google Scholar Citations that appeared on the Webometrics website in July 2016. [1] The Transparent Ranking or Ranking Web of Universities was created by the Cybermetrics Lab (Spanish National Research Council), with the objective of testing the validity of using Google Scholar Citations (GSC) to rank universities and research institutions. It started in a small way in 2004 and has updated the rankings every six months, providing information about the performance of institutes all over the world. Currently, it is in an experimental state (beta). [16][17][18][19][20][21] The methodology of the TR Ranking uses individual author profiles on Google Scholar Citations which have standardized (official) institution names and an official e-mail address. GSC automatically updates the citations in the individual profiles. There are at present about 1million profiles from 5000 institutions in Google Scholar Citations. [1] This data is nonetheless incomplete since people are required to voluntarily create their profiles on GSC, which implies that not every scientist/author has a profile. People are also required to make their profile public, for the profile to be included in the evalu-ation procedure. According to the Cybermetrics Lab, the data is large enough to give a representative ranking of world universities in spite of the incompleteness.
GSC-TR uses an unusual procedure in its evaluation. Not all the profiles in GSC are used. To make the ranking sizeindependent, only the top-ten (but excluding the first) profiles from an institution are selected and the citations aggregated over the remaining nine profiles to obtain the institutional citation. (The topmost profile is eliminated to exclude possible outliers). The citation aggregates are then used to rank the institutions. In other words, Google Scholar focuses on a sample of elite scientists to arrive at the ranking. It is similar in spirit to the Nature index, [22] which also focuses on an elite set of journals (and thereby elite scientists). At the same time, since it takes the same number of profiles from each institute, large or small, the resulting ranking is size-independent.
According to TR, steps are taken to deal with problems such as the presence of duplicate profiles, etc. [1] In the latest edition, the Ranking Web of Universities contains 4132 Institutions with an aggregate of 174,915,125 citations. This makes it one of the largest ranking exercises in the world. Harvard University has the highest rank with 1389765 citations. The country where the institution is located is also listed, making it possible for obtaining country-level indicators.

Objective
Our objective in this paper is to gauge national citation performance based on the upper echelons within individual GSC profiles in an institution. Since the exercise by Cybermetrics Lab is in beta, our results will also be subject to the same caveats, i.e., uncertainties introduced by the incompleteness of the data and sampling technique described above. It is not clear for what period the citations have been collected, but possibly the period begins from the start of GSC to July 2016. The validity of the GS-TR list is tested by rank correlation with other currently accepted ranking schemes, using the top 20 institutions (Table 1).

METHODOLOGY
We have divided the data on institutional ranking into percentiles (deciles of 400 institutions each; a few remaining institutions being grouped as 'Extra'). The top decile contains 400 best performing institutions with the highest citations. The country labels indicate how these top institutions are distributed among different countries. The 400 institutions in the second decile are a notch below those in the top decile and so on, progressively declining in quality till the 10 th decile.  The basic characteristics of the ranking schemes mentioned here are shown in Table 2. 1) Examined the distribution of citations and institutions in countries in each of the deciles.
2) Tested for rank correlation between the institutions ranks assigned by GS-TR and other current ranking schemes.
(The Google Scholar ranking is also referred to as the Ranking Web) The top 20 institutions (ranked by webometrics, Table1) were taken and their corresponding ranking from QS, LEIDEN, SCIMAGO, THE and ARWU were collected and tabulated (

RESULTS
We

Citation Statistics
Citation statistics for different deciles are given in Table 3.
Average citations per institute range from 272,923.6 citations in the top decile to 80.9 in the Extra decile ( Table 3)

Regional Distribution of GS-TR Citations and Institutions
Aggregating the citations and number of institutions for different countries into regions we obtain the regional distribution of citations ( Figure 3). North America, Europe and Africa have more than 10,000,000 or 10 million GS citations each. Oceania, South America and Asia have between a million and 10 million GS citations each. The largest number of institutions is from Asia and the smallest from Oceania (<100).

Distribution of a country's institutes in different deciles
Distribution of citations in different deciles is seen in Figure 4a and 4b. For every country, the number of institutions in the 1 st , 2 nd , 3 rd , etc. deciles are seen in the column graph. A high performing country will have relatively more institutions in the top deciles. USA has largest number of institutions and citations in each decile.
In Figure 4a we see countries arranged in order of number of institutions (continued in Figure 4b). Each colour in a column stands for a decile (see legend). USA has overall 873 institutions in Google Scholar Citations (minimum citation 20) distributed in all the deciles. Countries with some institutions in the first decile get the highest contribution to citations from them. The average citation of an institute in the first decile is 272923.6 falling to 82674.5 in 2 nd decile, 37094.4 in 3 rd and so on (Table  3).
USA has the largest share of institutions (21.13%) followed by India (7.94%) Countries which have blank sections in the lowest part of a column in Figure 4a do not have any institutions in the first decile (e.g., Turkey, Iran, Poland, Malaysia, with no institutions in deciles 1 and 2, Columbia, Thailand and Argentina, Pakistan with no institutions in the first 3 deciles).

Rank Correlation
From Table 4 the rank correlation appears high for THE, QS and ARWU. THE, QS also take into account perceptual  instead of cumulating over all individual citation profiles for an institute, it selects just nine author profiles and aggregates over these to obtain a representative citation for the institute.
The major advantage of using GSC is that it is freely available and accessible to anyone with an internet-enabled computer.
Results obtained can be duplicated and verified. There is only a single indicator -citations -and there are no complicated calculations. The use of only the top profiles in an institution means that most of the information on citations is discarded and only the top edge retained to generate a rank order. How does the Google Scholar ranking compare with other schemas? Rank correlation shows that correlation is low with THE and QS. This may have been expected as they are partially surveybased and partially based on bibliometric indicators. The correlation with Scimago and the Leiden Ranking, which are based only on bibliometric indicators is better. In fact, correlation is best with the Leiden ranking (~ 0.7) which also has a size-independent variant.
We suggest that the procedure adopted by Webometrics in selecting their samples puts the focus on the 'cream' in each institution. This is also the segment with which academic 'reputation' is associated. However, unlike the Nobel Prize, it judges contemporary excellence and not historical reputation.
One of the criticisms faced by bibliometric evaluations of universities is that they emphasize research over teaching. Undoubtedly teaching is an important component of universities' responsibilities, but it is much harder to quantify. Good research, on the other hand, may also reflect good teaching as the two feed back into each other through students. The other criticism relates to the evaluation process. Peer review has been taken as the gold standard of evaluation, but in the case of world rankings, it would be clearly impractical for various reasons -the volume of information to be processed, the time required and the cost involved. Together with this, the fact that reviewers have expertise in small domains and cannot be expected to judge entire universities, especially when they are distributed all over the world. [22] variables such as reputation. ARWU includes awards, prizes in the rank computation. In contrast, Leiden and Scimago take scientometric indicators such as total papers and citations into account. These are found to be moderately correlated with GS, which is size-independent and based on top level citations.

DISCUSSION AND CONCLUSION
Earlier, performance-based global rankings of universities, or other institutions, whose members published academic or research papers, used diverse parameters including survey responses by experts, quality of teaching, internationalization in student and staff bodies, in addition to publication based performance evaluation in terms of number of papers and citations (e.g., Times Higher Education (THE) and Quacarelli-Symonds (QS). In these early exercises data was gathered from the universities to conduct the evaluation). ARWU, the Academic Ranking of World Universities from Jiao Tong University in Shanghai was the first ranking where data was taken from a non-partisan source, the Web of Science which recorded bibliographic details of journal articles and their citations. The ARWU gave weights to Nobel prize winners and Fields medal holders to rate the universities, in addition to the usual bibliometric indicators (papers, citations). The Leiden Ranking uses only bibliometric indicators (citation and collaboration) and ranks institutions in the first percentile (top 1% by citation) and decile (top 10%), It has both sizeindependent and size-dependent rankings. (A size-independent ranking can compare two institutions of different sizes). Other rankings, such as SCIMAGO from Spain, use SCOPUS which has a much higher coverage of journals than WoS for their data. Both Web of Science and Scopus have now introduced proceedings papers together with journal literature. The URAP from Turkey and the NTU from Taiwan are other recent entrants on the ranking scene. The Webometric Ranking differs from the others in several ways. It does not take its data [23][24][25][26][27][28] from a citation Index like WoS or SCOPUS, but directly from Google Scholar Citations on the web. Moreover, One can conclude that the Webometrics methodology has an advantage as it processes only a very small fraction of the full data. It is therefore economical in terms of data handling, time and cost. It is capturing something analogous to reputation or academic excellence by considering a few top-ranking members using bibliometric indicators rather than surveys. Finally, it is size-independent, a property that is very useful as sizes of universities are highly skewed. Otherwise, the ranking is dominated by the large established institutes and smaller institutes pass under the radar. Although Leiden and Scimago also use bibliometric indicators, papers and citations, they capture the overall citations of an institute and are seen to differ considerably from Webometric ranks in Table 3, even in the case of large and prestigious universities like Princeton and Carnegie Mellon. Finally, there is a movement toward open data in the academic world, which Google Scholar satisfies well.
Some of the disadvantages are the many errors that creep in while automatically collating the data on the net, the question of where to stop when taking non-peer reviewed literature into account and whether Webometrics will be able to capture the confidence of academics, managers and administrators the world over.

ACKNOWLEDGEMENT
Aparna Basu thanks South Asian University for financial assistance as Guest Faculty.