‘ Big data ’ Research : A Bibliometric Analysis of the Scopus Database , 2009 – 2019

Scopus-database publications containing the keyword ‘big data’ have skyrocketed from 30 (2009) to almost 16,000 (2019). This trend reveals this field’s importance across disciplines and contexts. Previous works have analysed the emergence and characteristics of scientific research on ‘big data’ but need updating. We undertook a bibliometric analysis of over 73,000 such 2009–2019 publications. This data helped to identify the primary trends, subjects, networks and institutions publishing on big data worldwide and explain the relations and differences between scientific communities working on this subject in central and peripheral countries. Furthermore, this research highlights Chinese researchers’ and institutions’ prominence in this field alongside the influence of American contributions, which are most frequently cited. The emergence of dynamic poles of scientific production in middle-income countries in Asia, Africa and South America are also studied. Despite the dynamism of the field, about 2% of the articles account for 40% of the field’s citations, while 42% have no citations. Originating in computer science and engineering, big data research is increasingly becoming interdisciplinary. Keyword trends over time also show a shift from technical and prospective concerns towards (1) methodological and practical issues and (2) the development of AI and machine learning techniques. These indicators present differences between countries with varying geo-economic conditions. Collaboration networks have rapidly grown with the US and China as the main nodes and European countries as intermediaries in the circulation of this topic. Although still rare, there are some signs of South-South collaboration between Latin America, Africa and Asia.


INTRODUCTION
Production, management and processing of data have been constant concerns for modern States since their early days. Indeed, epidemics, population control, tax collection and warfare have stimulated innovations in the collection and processing of reliable data for several centuries in order to produce information useful for government decisionmaking. Technological innovations in the post-war period allowed the digitalisation of data, and with it, the States started systematically using digital platforms to organise and process the information collected on their citizens, territories, institutions and economies. As part of this process, concerns about processing large amounts of information started emerging during the 1970s and 1980s. [1] However, the 'data problem' as we know it today started materialising with the introduction of the World Wide Web in the early 1990s. The exponential growth of data production and storage on digital media platforms encouraged several technological innovations to address both support (hardware) and processing (software) issues. Several publications from the 1990s and early 2000s account for this process. [2][3][4] Some of these publications have already explored the effects and potentialities of big data for economic and financial analysis. [5,6] paid to the geographical and economical differences as well as the linkages and circulation of knowledge between countries from different economic regions. A second objective of this paper is to shed light on these differences and linkages by analysing not only the performance differences between countries but also the international collaboration networks behind this explosion of publications during this decade.

LITERATURE REVIEW
There is a handful of specific bibliometrics analysis on 'big data'. The first one, published as a blog post on Research Trends in 2012, highlighted the growing trend of publications since 2008 and the emergence of 'big data' as a research area. [12] It was followed by works by Singh et al. [20] from India, Tseng et al. [21] from Taiwan, Peng et al. [22] from China and Brazil, López-Robles et al. [23] from Spain and Mexico, Gupta and Rani [7] from India, Parlina et al. [14] from Indonesia and Raban and Gordon [19] from Israel. Table 1 summarises the main features and conclusions of these earlier works.
Google regarding its own systems: Google File System [17] and MapReduce. [18] These seminal articles marked the north for this type of data processing. Owing to the great transformations produced by the application of this new technology, the notion of 'big data' began to spread from the computer engineering and data fields to other areas with immense potential such as health, business, finance and social sciences.
Previous studies [7,12,14,19,20] have pointed out that the number of publications on the subject started to increase from around 2008. However, the trends observed in the first publication are now very limited compared to the more recent volume of publications. In eight years alone (from 2011 to 2019) the annual publications on this topic (in all areas and disciplines) increased from 90 to 16,000. This paper aims to update the observations and findings of previous bibliometric works by extending their analysis to the entire scientific corpus in the last decade. Moreover, most of these publications have focused on undertaking an analysis of the general bibliometric performance of the field. However, little attention has been Relations between "data mining" and "big data" scientific literature. Continuing with the same line of analysis as these publications, this article aims to contribute, update and extend on their findings by adopting a double strategy. On one hand, we use a bigger and up-to-date dataset which will allow us to capture a broader picture of big data research's evolution until the end of 2019. On the other hand, we focus on some characteristics of this scientific production that were neglected until now in previous works. Through such an analysis, we expect to produce a more general picture of the international structure of scientific production on this topic in order to identify the weaknesses and opportunities for a more balanced global development of the field.

DATA AND METHODOLOGY
Most previous works have based their analysis on datasets from the Web of Science (WoS) delimited by the research area or type of documents. Gupta and Rani [7] and López-Robles et al. [23] Used the biggest datasets available for their analysis which were about 25,000 each. Aiming to obtain different insights, this paper uses a larger dataset from a less explored source. It has been built through a general query of documents containing the keyword 'big data' in the fields 'title, abstract or keywords' of the Scopus database. This query provided a result of 75.300 documents which were published between 1970 and 2019. The results were then exported as several 'csv' files containing 2,000 entries each with all available information on citation and bibliographic details, abstracts and keywords, funding and other details. These files were compiled and subsequently validated through a verification algorithm developed on python 3.7 to exclude duplicates and entries without even minimal information (author and affiliation) which were needed for further analysis. The resulting dataset Relation between "data science" and "big data" Different paths of the scientific production ("data science": gradual vs. "big data": exponential / New trend of publications combining concepts from both corpus / The two fields have different academic origins and leading publications / Big data literature is more prominent, has intensive citation activity and bigger funding, particular from China / Data Science literature serve as a theory-base or a toolbox for big data publications. contained 73.230 entries. The raw data obtained in Scopus contained authorship and affiliation for each article in single cells. Therefore, additional splitting operations were required in order to obtain information on an individual basis. The split dataset of about 261.826 entries gives us individualised information on every author of these 73.230 publications. Even though several of these entries correspond to the same individuals, they provide specific information about the conditions under which these authors contributed to those publications (affiliations, funding, etc.).
This dataset was then analysed for three main aspects: scientific productivity and performance indicators; thematic maps, clusters and trends and collaboration networks. Scientific productivity and performance indicators include volume and growth rates of publications, major contributors, journals and articles disaggregated according to year, document type, institutions and countries. The thematic analysis considered the distribution of publications by research area, top research areas by year and country, keywords clustering and trend analysis over time. Finally, a network analysis was performed at three levels (authors, institutions and countries) to identify main, local and international collaboration clusters and trends. Special attention was given in each step to the differences and connections between countries, institutions and researchers from different geographical and economic conditions. We have used the location within continents and subcontinental regions as well as the World Bank Country and Lending Groups classification for 2020 as indicators of these differences: low-income economies (Gross National Income per capita (GNIpc) <= US$1,035); lower-middle-income economies (GNIpc between US$1,036 and US$4,045); upper-middleincome economies (GNIpc between $4,046 and $12,535) and high-income economies (GNIpc > $12,536). [24]

Scientific Productivity and Performance Indicators a) Scientific output
The very first recorded entry on the Scopus database containing the keyword 'big data' was published in 1970. Up to December 2019, 73,230 items containing this keyword or pre-coordinated concept [19] either in the title, the abstract or the keyword have been published. This literature is composed mainly of conference papers (58%) and journal articles (32%).
Other contributions are less representative: Reviews (4%), book chapters (3%), books (1%) and others 3% (editorials, notes, articles in press, letters, short surveys, etc.). The production trend of this scientific output has not followed a steady trend; however, it follows an exponential one started a decade ago ( Figure 1). Indeed, the scientific production on this subject advanced from around 30 articles published annually at the end of the 2000s to almost 16.000 by 2019. This shows an average growth rate of about 90% per year.
Such scientific productions are not evenly distributed around the world. China (20,838) and the United States (16,696) have indisputable leadership in this field. However, it is worth noting that even though in the early years US-affiliated researchers led the field in terms of the number of publications, they were quickly overtaken by China-affiliated researchers. Since 2015, North American publications have stagnated while China have doubled the number of publications ( Figure 2). This outperformance is most likely related to China's technonationalistic R&D policy that aims at the country's digital transformation and global leadership in the data and artificial intelligence industries. [25] India, with over 1,500, and Great Britain, with almost 900 publications, in 2019 also reflect a growing interest in the subject. During the last five years, the former has presented a higher average annual growth rate (31%) than the latter (15%). Other important contributors from Asia, Europe, Australia and Canada have annually produced 300-600 articles; this represents an average annual growth rate of about 14% during the last five years. Among newcomers, Russia is one of the fastest-growing contributors, with an average annual growth rate of about 36% since 2015.   . Most of these publications were produced in high (55%) and upper-middle-income countries (34%). Contributions from lower-middle-(11%) and lowincome countries (0.1%) are less representative ( Figure 3).

a) Authors and Institutions
Only 12% of these publications constituted of individual contributions. Most of these (75%) were produced by teams made of 2-5 researchers, 12% by teams of 6-10 and 1% by teams of more than 10 researchers. It is worth noting that 16 of the latter teams were the output of collaboration networks of 50 to 100 researchers and 8 of networks of more than 100 researchers from 28 countries. However, most of the collaboration networks behind these publications are nationbased. Only 19% are international, out of which 15% are binational, 3% include researchers from 3 countries, 1% from 4 countries and less than 1% were from 5 or more countries. However, the core of this scientific body, composed of authors with more than 10 publications, has only about 1,000 researchers. Among these, the top 10 authors have more than 50 publications (Table 2). They are based mainly in China, the US, the UK and India but also in Italy, Canada, Australia, Spain, Saudi Arabia, Korea and Portugal.
According to the country's income level, the main authors among upper-middle-income countries are mainly Chinese, South Africans and Colombians, with 35 to 50 publications. The leading researchers from lower-middle-income countries are predominantly affiliated with institutions in India and Morocco. These researchers have produced between 18 and 35 articles. Finally, researchers from low-income countries are mainly from Africa, the Middle East and Nepal. However, their productivity remains very low: 2 articles per author, except for Sun (12) and Maharjan (4) ( Table 3).

Main Journals and Conferences
In order to complete this general overview of the scientific productivity and performance indicators of the global research on 'big data', it would be valuable to identify the main conferences and scientific journals which present and publish the results of these studies (Figure 4). With respect to the former, the proceedings published in the series Lecture  All other journals used by researchers from lower-middleand low-income countries to publish their results are different (Table 5). These numbers suggest that despite the availability of some journals which articulate the research and debates

Citations
Last but not least, besides the volume of scientific production, the number of citations of an article denotes its relevance or influence in the field. The dataset used for this study shows that the biggest part of this scientific production (42%) has no citations at all. This means that their results and insights have not yet found an echo in their respective scientific communities. Another 38% has between one to five citations, 8% between 5 and 10 and 9% up to 50 citations. 1.8% has more than 50 citations, which represents 41% of total citations in the field. Among the latter, about 30 articles have been cited more than 500 times and about 10 articles more than 1,000 times. Undoubtedly, these articles are the seminal articles in the field (Table 7). Subjects, domains and sources of publications of this literature are very heterogeneous. The domains include bioinformatics, urbanism, computer sciences, artificial intelligence (AI), business, health and psychology, among others.
It is worth noting that despite the leadership of the Chinese production in this field, the influence of researchers affiliated to American institutions on these seminal publications is more dominant. German, British and Spanish researchers have also contributed to this core literature. Moreover, some of these publications also disclose the existence of collaboration networks between the Chinese, British and North American authors of these articles. If we look at the top 20 authors with more than one publication, it is evident that half of them are researchers affiliated to American institutions. The other half consists of Malaysia-, China-, UK-, South Africa-and Georgiaaffiliated researchers (Table 8). Guizani appears to be the most influential author in the field with 25 publications and more than 2,600 citations. However, 85% of these citations refer to only one of his articles: 'Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications'. Other authors, such as those from the Spanish team behind the top article (Table 7), have been excluded from this list since they have only made one contribution to this domain of research.
Most of this research is interdisciplinary. This suggests that many citations of these articles are not related to 'big data' and originate in other scientific communities. The dataset used for this study doesn't contain sufficient information to only measure the citations within the same corpus. This issue could be addressed by further research.
If we split this data according to the income categories of the countries, we find that the top ten influential authors in high-income countries are all North American affiliated researchers. On average, every author has approximately 10 publications each and between 1,700 and 2,600 citations. In Journal of Scientometric Research, Vol 11, Issue 1, Jan-Apr 2022   (Table 9).

Thematic Analysis Distribution of Publications by Research Area
Most of these publications are contributions to computer science and engineering (81%). Contributions to medicine and social sciences are only 4% each and mathematics, biology, business, physics, earth, decision and environmental science are 1% each. The remaining contributions of 4% are shared by the other 17 fields of the Scopus classification ( Figure 5). However, these numbers can be deceptive since several of these publications are situated at the intersection of different fields. In fact, 48% of these publications are categorised in at least two different research areas, 11% in three and 4% in four or more areas, which highlights the interdisciplinary character of this research field. If we disregard the computer science and engineering fields in the publications categorised under at least two areas, the consequent distribution is more heterogeneous.
Mathematics represents about 18% of these contributions, those in decision science 11%, social science 7% and medicine 5%. The rest of the fields, which collectively represent 18%, double their participation (Table 8). Finally, if considered individually, the publications in every field in relation to the total number of publications in decision science rise up to upper-middle-income countries, researchers with Malaysian and Chinese affiliations are the most influential; on average, they have more publications than their North American peers (19) and between 1,300 and 2,300 citations each. The top 10 researchers from lower middle income are mostly from India. On average, they have published 14 papers and have been cited between 190 and 1000 times each. Finally, among the researchers from lower-income countries, Sun, affiliated with the PNG University of Technology at Papua New Guinea, is the most prolific author with about 14 papers and over 90 citations. Other researchers from these countries include science and engineering field (58%). The share of this field has, however, been reduced to 35% between 2017 and 2019. Thus, this reduction reflects the diversification of approaches and development of interdisciplinary projects with scientists from other areas, particularly mathematics, social and decision sciences, medicine and business ( Figure 6).
14%, social sciences to 11%, medicine to 7% and business to 6%. This reveals that approximately 15% of these publications are at the intersection between the social sciences, decision sciences and engineering and computer science fields.

Research Areas by Year
Big data research started as a computer science subject but has rapidly spread in other research areas. The publications before 2013 were mainly contributions to the computer

Research Areas by Country
By country income group, we observe slight differences between high and upper-middle-income countries with relatively more contributions from computer sciences, engineering, social sciences, medicine and business fields in If we look at the leading countries by research area, we can see that they more or less follow the same pattern as the global analysis: China and the US lead almost every field and particularly the most important ones. However, there are some areas where the US maintains leadership over Chinese research. It is evident in the case of social sciences, agriculture, arts and humanities, biochemistry, economics, business, health, immunology, medicine, psychology, pharmacology and neurosciences. It is equally interesting to note that apart from the main players, not all countries have produced research in all areas, and its development is uneven. This suggests a certain degree of specialisation in some countries (Figure 8).
the high-income countries and more mathematics, decision sciences and materials in the upper-middle-income countries. Lower middle and low-income countries showcase an increased number of differences for this indicator. The contributions from the former are relatively more concentrated in computer sciences and less in social science and medicine than all the other categories. The latter has relatively fewer contributions in computer sciences and engineering (33%) and relatively more in the decision and environmental sciences (Figure 7).

Keywords Trends over Time
Beyond the classification by research areas, keywords used to index these publications offer us a more specific picture of the main topics studied within big data literature. They also reveal the predominant trends and evolution of this research over time. The word clouds in Figure 9 show three different phases within this literature. At first, the publications were more diverse, but they were also more general and mainly oriented towards technical challenges, impacts and potential applications. By 2015, the research focused on technical and methodological issues. By the end of the decade, the focus of the research seems to have shifted from the application of the accumulated knowledge and techniques to the development systems and techniques. On the other hand, lower-middle and low-income countries present quite a different picture. The latter countries are focused on the research of technical issues related to big data such as mining, the Internet of things, smart systems, etc. Even if they are less prolific, the latter countries are more diverse and centred on leveraging this knowledge in learning, predicting, detecting and, to some extent, in social problems ( Figure 11). A review of the subjects of the publications within the field of social sciences shows that despite the importance which big data has acquired in recent years, there is a deficit of academic production in the role and effects that it has on the State's decision-making processes, design of public policy instruments and its relations with other sectors of society. [26][27][28] Figure 10 shows word cloud with the main keywords of the articles published each year. Words' sizes reflect the relative frequency of each keyword within each year. As we can see, initially, the few articles on social sciences using the concept were related to technical problems. However, by 2015, a great deal of these publications seem to have focused on urban problems such as traffic or smart cities along with technical and methodological issues. By 2019, digital media, artificial intelligence, smart cities and systems seem to have become the main concerns of social scientists using this concept ( Figure 10).

Collaboration networks
The main topic picture also changes if we look at the differences among group income countries. Publications from high-income countries are more focused on the application of this knowledge to machine learning, analytics, computing, management and to a lesser extent to social and smart systems problems. Upper middle-income countries follow a similar trend, but they seem to be more advanced in the application of this knowledge to the development of artificial intelligence  have more than 10 links. Figure 13 shows the core network of the scientific community working on big data which connects about 1,000 researchers with more than 10 publications with approximately 9,000 collaborators in their countries and abroad: 60% collaborate within the same country and only 18% collaborate with authors from countries with different income categories. Most of these collaborations are between Chinese, American and European researchers. There are collaborations between South American researchers and researchers in North America as well.  while North Americans represented only 15%. This was also the case with Africa, where approximately 40% of 1,300 external collaborations were with European countries. Asian networks seem to be more self-centred with more than 36% of the 16,000 links within the continent, but they also have strong connections with North America (27%) and Europe (25%).
If we analyses international collaborations by author, we get a relatively scattered landscape where about 168,000 individuals mostly collaborate in small independent networks: 62% have only one link, 13% have more than five links and only 5% Finally, if we take a look at the collaboration networks by affiliation institution, we get a more structured global network of about 44,000 actors organised around two hubs: one formed by Chinese institutions (17%) and the other formed by North American (18%) and European institutions (30%), especially universities (42%). These two hubs directly interact through a great variety of smaller actors, including institutions in different continents, regions and geo-economic categories.
The average institution has approximately 100 links with other institutions and the biggest ones (<1%) have more than 1,000 links. Figure 14 shows the network of affiliation institutions collaborating on research regarding big data with institutions Africans and Colombians lead upper-middle countries, while India and Morocco lead lower-middle-income countries.
Of the top 20 institutions hosting big data researchers, 18 are Chinese, followed by the University of Delhi and MIT. Lectures Notes in Computer Science is the most important agora on big data, and IEEE Access is the principal academic journal on this topic. Regarding citations, besides the higher productivity of Chinese researchers and institutions in this field, American contributions remain the most influential and the most cited articles and authors come from the US.
Despite the dynamism of the field, about 2% of the articles concentrate 40% of the citations of the field, while 42% of these publications have no citations whatsoever. The main scientific field publishing on big data is computer science and engineering (81% of the publications in the entire period) followed by medicine and social sciences. However, since 2017, the relative importance of this area has reduced to 35% due to the development of interdisciplinary projects in mathematics, social and decision sciences, medicine and business. The keywords trend over time shows that, by 2010, literature was mainly oriented towards technical challenges, impacts and possible applications of these technologies; by 2015, it focused on technical and methodological issues; and in the past few years, it has shifted towards the development of AI and machine learning techniques. Lastly, we observe an important scientific collaboration activity: only 12% of the publications are individual contributions. However, most collaboration networks are nationality-based; only 19% belong to international networks. This has changed over the last five years from national-centred clusters to a more international network, where the US and China are the main nodes. European countries appear to be the main intermediaries in the circulation and development of the knowledge in this field with countries from Africa and South America. Although to a lesser extent, we have also detected some south-south collaborations between Latin America, Africa and Asia. Thus, we have presented a detailed characterisation and a comprehensive analysis of the big data global research over the last decade. This research updates previous bibliometric works by extending the previously analysed corpus and exploring an unresearched database. Our most important contribution to bibliometric analysis is the insights we provide on the differences in scientific productivity, research areas and trend topics, as well as the collaboration networks between countries from different geo-economic conditions. These differences highlight the uneven distribution and circulation of big data knowledge behind the growth in publications over the last decade. Further research could provide a deeper and more detailed characterisation of research in this field in specific regions and countries, as well as in specific research areas and topics.
in the same country (47%) and abroad (53%). Among the latter, only 17% work with institutions in different income country categories. European institutions have links with African, Latin American and Asian institutions. The data also shows some south-south collaborations among countries in Latin America, Africa and Asia.

CONCLUSION
We have presented an analytical mapping of research on big data from a set of more than 73,000 entries of the Scopus database which were published in the last two decades. We evaluated this corpus for three main aspects: (1) scientific productivity and performance indicators; (2) main research areas and thematic trends and (3) collaboration networks between authors, research institutions and affiliation countries. We directed special attention towards the main relations and differences between scientific communities working on this subject in countries situated in different regions and economic conditions. Our main findings show that scientific productivity has exponentially increased since 2010 with an average growth rate of approximately 90% per year. China and the United States lead the scientific production on big data. According to country income level, Chinese, South