A Comparative Analysis of Unified Informetrics with Scopus and Web of Science

Numerous bibliographic databases track the number of publications, citations, and h-index to maintain the progress of an individual. However, the choice of journals varies among these databases, hence they produce different numbers of publications, citations and h-index for the same author. Various literatures are available on the comparative analysis of such bibliographic databases stating the fact that there are the differences between bibliometrics generated by different bibliographic databases but none of the literature provides any comprehensive or complete solution to overcome such differences. At present, there is no common platform that can provide a single count of number of publications, citations and h-index across multiple bibliographic databases. To overcome this limitation, we propose a new method in academic research publication to calculate weighted unified (single) informetrics for an author. With the proposed solution, one can view a single article, citation, and h-index count computed from multiple indexing databases. In this study, the data from Scopus and Web of Science is used to generate a new database named “conflate”. Further, a comparative analysis of the proposed model is performed with Scopus and Web of Science at three levels: author, organization, and journal. The proposed model can be observed as a new single indicator to determine the research influence of the author, organization, and journal. Aim: We propose a new method in academic research publication to calculate weighted unified informetrics. With the proposed solution, one can view a single article, citation, and h-index count computed from multiple indexing databases. The proposed model can be observed as a new indicator to determine the research influence of author, organization, and journal.


INTRODUCTION
Scientific contributions work as a driving force for the continuous growth of science and society. [1] To measure the impact of such contributions, citations provide a quantitative evaluation which helps in describing publication patterns, research quantity, quality, and the influence of authors. [2,3] Citation data is often impacted by the coverage of bibliographic databases such as Scopus, Web of Science (WoS), Google Scholar etc., because they collect the citations received by the publications indexed by them only. [4,5] Hence, one can observe different bibliometrics such as number of publications, number of citations and h-index for same author in different bibliographic databases.
The prolific growth of bibliographic databases has created new opportunities. [6] Bibliographic databases are used worldwide to produce comparative statistics for bibliometricians [7,8] Different authors have shown the comparison between Scopus, WoS, PubMed, Google Scholar etc., in various literature based on availability of digital object identifiers (DOIs), [9] their bibliometric analysis, [10,11] their coverage, [12] their features and citation properties, [13][14][15][16] their strengths and weaknesses, [17] their content comprehensiveness and searching capabilities, [18][19][20] their longitudinal and cross-disciplinary comparison of coverage, [21,22] their language coverage, [23,24] their use in academic papers, [25,26] their systematic comparison of citations based on subject categories, [27] their journal coverage, [28][29][30] their retroactive growth comparison of universities, [31,32] h-index of authors, [33,34] and countries. [35] Comparative statistics provided by various authors are utilized by funding agencies, government bodies, promotion committees, ranking agencies, accreditation agencies, and other stakeholders to measure the quality and impact of authors. Hence, bibliometric analysis has emerged as a powerful tool and partial system for its stakeholders.

Key idea or concept
Limitations identified by Authors Any solution proposed?

Author Ref with year
To compare the major features of the Web of Science, Scopus, and Google Scholar as a citation database.
Traditional indexing databases lacks proper subject indexing, hence there is a need to introduce an idea which serves as a solution to citation-based searching for quantitative evaluations. × [1] 2005 To compare strength and weaknesses of PubMed, Scopus, Web of Science, and Google Scholar databases.
Databases are compared in context of their content and various practical aspects. No such limitations are mentioned. × [17] 2008 To compare h-indices of highly cited researchers of Israel based on their citation count in Web of Science, Scopus and Google Scholar.
Disciplinary differences in coverage and differences in citation counts are observed across databases. × [33] 2008 To gauge the comparability in determining the h-index from Scopus and Web of Science for 10 universities.
Significant differences are observed in the content and cited references from both databases. × [35] 2009 To compare two instruments like Scopus and Web of Science for a typical university in Portugal.
Different abstracting policies, and apparent errors in constructing the databases are identified. × [31] 2009 Three citation databases are compared with reference to book -introduction to informetrics.
Findings clearly reveal those citations across databases are clearly comparable. But there is no single citation database that can supplement other.
× [14] 2010 Three citation resources are compared to find the one with most representative citation coverage.
Results show that there is a variation in the retrieved data in context of citation counts. × [20] 2013 Journal coverage across Scopus and Web of Science is described.
Results indicate that use of either of these databases for research evaluation may be biased. Hence both should be used with caution.
× [28] 2016 Systematic and comprehensive comparison of coverage across Scopus, Web of Science and Google Scholar is provided.
All three databases provide sufficient coverage for cross disciplinary comparisons. But results show that specific metrics change the conclusions across databases.
× [21] 2016 A light has been shed on the availability of DOIs in Scopus and Web of Science in publication items.
Both databases lack the 100% availability of DOIs, hence authors are encouraged for DOI based establishments. × [9] 2016 Research publication data from Web of Science is taken for Indian central universities from 1990-2014 for the study of their ranking and policy purposes.
Study introduces the idea of quality-quantity composite index for central universities in India. At the end, a generalized model using variables and their weights is also proposed as an optimal solution of ranking.
Partial [45] 2016 Google Scholar, Web of Science, and Scopus are compared based on 252 subject categories.
Study provides evidence that Google Scholar has more citations as compared to Scopus and Web of Science. Google Scholar may be seen as a super set of Scopus and Web of Science.
Partial [27] 2018 Bibliometrics based on highly cited documents in Scopus, Web of Science and Google Scholar is explored.
Study demonstrates that these databases miss a significant amount of information (if compared) based on counts of highly cited publications.
× [12] 2018 Publications for Jordanian authors are studied based on literature databases such as Scopus, Web of Science, PubMed etc.
Results show that Scopus, Web of Science, PubMed etc., have differences in terms of their coverage, focus and the tools. × [11] 2019 Web of Science and Scopus are compared based on their language coverage of publications.
Results obtained at document level, languages and key areas are different from journal level analysis for both Scopus and Web of Science.
× [24] 2019 Retroactive growth, correlation, and coverage of universities is validated based on Scopus, Web of Science and Google Scholar.
Institutional productivity varies across Google Scholar, Scopus and Web of Science in terms of total number of publications. × [32] 2019 Comparative, dynamic, and empirical study is presented based on academic papers available in Scopus and Web of Science.
A deeper analysis based on the content of Scopus and Web of Science requires further investigation. × [26] 2020 Citation coverage is presented based on Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and Open Citations COCI.
Results reveal that no single database is adequate as a bibliographic database. Future studies may reveal that which data source is most suitable for the needs of stakeholders.
Partial [22] 2021 Comparative analysis of journal coverage is aimed for Scopus, Web of Science and Dimensions.
Results indicate that databases have significantly different journal coverage. × [30] 2021 Impact of author ranking based on Scopus and Web of Science is introduced with an improvement to h-index.
Results reveal that there is significant difference between h-index calculated in Scopus and Web of Science. Partial [34]  Hence, we propose an algorithm ( Figure 1) to calculate bibliometrics for the single count of publications, citations, and h-index for authors to various stakeholders.
The objectives of our study are: 1. To propose a common platform that can provide a single article count, citation count, and h-index in the education field.
2. To check the statistical validity of the proposed platform in terms of the number of articles, the number of citations, and h-index at author, organization, and journal level.
The Weighted Unified Informetrics (WUI) Algorithm: In the proposed algorithm, we have used bibliographic databases such as Scopus and WoS due to their indexing age, availability of data, and authenticity. A weighted unified informetrics system named "conflate" has been discussed and proposed ( Figure 1).

METHODS
Generation of doi based citation database: For data extraction from both Scopus and WoS, we require inputs at three levels. An ORCID ID is required for authors' information, organization name for university/institute access, and ISSN for journal information. Based on the author's ORCID ID, we retrieved the number of publications from both Scopus

Motivation
Bibliometrics enable the global knowledge for categorization and practical analysis of research contributions through bibliometric data sources such as Scopus, Web of Science, etc. The generated dataset is accessed for many purposes, analytic functions, fundings and scientific extensions. Recognized scientific characteristics and investigations reveal that bibliometric data sources such as Scopus, and Web of Science presents the different performative characteristics (see Table 1) such as, different number of papers, citations and h-index of same scientific inputs.
Hence, this study provides a one-stop solution for various stakeholders to provide single paper, citation and h-index count for scientific individuals or groups across multiple bibliometric data sources. Here, we are using the DOI filtration to check the availability of publication in multiple bibliographic databases and aggregation to provide the actual and authentic count of informetrics. Table 1 presents the key areas and limitations listed in the studies performed on comparative analysis of bibliographic databases. Results reveals that few studies have tried to provide such as [22,27,34,45] a partial solution but none of the literature provides any comprehensive or complete solution to overcome the limitations of bibliographic databases.

Research Gap
Due to such limitations, universities, accreditation agencies, ranking agencies, and hiring agencies ask authors to provide publications, citations, and h-index count of all bibliographic databases separately during their job applications as well as in their assessments. There is no common platform that can record or calculate single informetrics across multiple bibliographic databases. This situation has raised a requirement of bibliometrics where an author can provide a single count

Computation of weighted unified informetrics
The conflate database generated in Algorithm 1 is used for further computation of weighted unified informetrics. First, for a given author, common and unique citations will be filtered. Then, a weight is assigned for the final calculation of unified informetrics. Algorithm 2 describes the process in sequence and Figure 2 summarizes the computation of weighted unified informetrics for different entities.
Algorithm 2: Describes the process in a sequence which will be repeated for organization and journals as well.

Data extraction
To perform this study, data from Scopus and WoS was extracted with Python-based APIs. For data extraction from Scopus, we have used Python based API wrapper named as "pybliometrics". It is an easy to use library to pull, cache and extract data from Scopus database [36]. Scopus database access is based on API keys available at (https://dev.elsevier.com/ apikey/manage). After successful creation of user account on Elsevier Developer Portal, anyone can obtain API key for programmatic access to citation data and abstracts, journals, research metrics and related metadata indexed by Scopus citation database. For data extraction from WoS, we have used Python client named as "wos python client". It is a SOAP (Simple Object Access Protocol) based client for querying WoS database to retrieve results in the format of XML [37]. Web Services Premium access, which is a paid service of WoS is required to extract data from WoS citation database (https:// developer.clarivate.com/).

Data Description
Data selection is performed at three levels:      Figure 3). To perform this study, data from Scopus and WoS was obtained with Python-based APIs. [36,37] The primary reason for data selection from Scopus and WoS is arbitrary and the availability of data.

RESULTS
Here we have presented the comparative analysis of Scopus, WoS and conflate at author's, organization, and journal level.
Author level bibliometrics     Figure 6. Further, one can analyze the different disciplines of these organizations to keep track of the most popular discipline in terms of research publications.    NITs  50362  39059  47683  416302  318676  492760  74  66  80   Universities  565887  451489  499159  6051897  4917831  6997951  85  76  93   IISC & IISER  77349  70063  73835  1252987  1107018  1542242  98  92  111   IITs  270495  236547  253042  3678723  2993534  4450159  122  110  134 it is observed that the average number of articles in WoS is 7971 whereas in conflate it is 8737. Conflate also reported a significantly higher number of citations with an average score of 134831 as compared to 93371 in WoS. Average h-index in conflate is also 100 which is quite higher than average h-index 82 reported by WoS. Table 3 represents the comparative analysis of Scopus, WoS, and conflate for articles, citations and h-index for 100 organizations categorized into 4 main head organizations.

Journal level bibliometrics
Here we analyzed 1000 journals and broadly divided into 5 disciplines (journal count), Engineering (800), Social Sciences (119), Life Sciences (35), Sciences (27), and Humanities (19).  Figure 8.   The major limitation of the study is the fact that we have considered the publications where DOI exists. In case WoS and Scopus do not have DOI numbers for the particular publications, we will not be able to consider the publication as authentic and the author will lose publications count and their citations count as well. Moreover, it could be a citation loss for low profile authors who have their work indexed only in Scopus or in WoS. For such journals which are indexed only in Scopus or in WoS, also shows their limitations to other bibliographic databases. If an author publishes his work in a journal that is indexed in multiple bibliographic databases, there is a good chance of higher visibility of a scientific work to be read and cited worldwide. As new bibliographic databases may populate in the near future, the proposed system should support the integration of those databases into the existing system.
To conclude further, there are still several possible areas for further exploration and extension. Here are some interesting areas for possible future developments and research: 1. Different bibliographic databases: We have studied the features of two bibliographic databases such as Scopus and WoS. Hence, the performed study is limited to two bibliographic databases. One can extend the study further with the use of bibliographic databases like Google Scholar, Dimensions, Crossref, OpenAIRE, DataCite, Mendeley, Zenodo etc. [38][39][40] All these bibliographic databases may create conflate as per the model to calculate unified informetrics.
2. Different technological aspects: One can extend the study further with the use of "Distributed Ledger Technology" and its core elements in the research publishing industry. Distributed ledger technology has found its applications in the field of education for verification of academic records, [41] adoption of smart learning environments, [42] and in implementation of mobile-based higher education systems. [43] Features like decentralization, persistency, anonymity, and auditability of records give more confidence to its stakeholders in a system presenting a scientific work of authors, organizations, and journals. [44] Hence, using Distributed Ledger Technology in the research publication industry can be considered as a viable choice to systematically achieve a sustained system in the interest of its stakeholders.

DISCUSSION AND CONCLUSION
The key findings of the work can be summarized as follows: (i) It presents a unified method to maintain records associated with entities of author, organization, and journal. This method determines an absolute number of articles and citations for different entities. (ii) The mapping of multiple bibliographic databases for the calculation of h-index, and related informetrics. (iii) The proposed system facilitates its stakeholders for the establishment of a system providing a clear, authentic, and simulated environment for the research measurement of entities. (iv) Presents in-depth analyses of the core components like publications, citations, h-index, etc.
The presented work has some advantages as (i) The DOIbased data filtration helps us to identify the authenticity of received citations and publications (ii) Different stakeholders like government agencies, accreditation agencies, ranking organizations, and funding agencies can use the proposed system for the evaluation of the research contribution of individuals, organizations as well as journals. (iii) The proposed system is a novel system introduced with the conflate of two traditional bibliographic databases like Scopus and WoS.
The proposed informetrics provides a transparent and distributed view of the research contributors to its stakeholders. Calculated results also signify the efficiency of "conflate". Scopus and WoS have been used for the implementation due to the availability of the data. At the author level, the performance of the proposed informetrics is mainly equivalent to Scopus for the number of publications, citations, and h-index. On the other hand, a significant difference is visible at the organizational level. The proposed informetrics shows the gain in the number of citations and h-index in all organization categories; however, average performance is observed for the number of publications. At the journal level, WoS has a higher count in Humanities for the number of publications but has a lower count for citations and h-index, whereas the proposed informetrics gives an average performance in the number of publications and best in citations and h-index. For other disciplines, both Scopus and the proposed model have almost similar results. In general, the proposed informetrics will always result in the best from multiple databases.