Some Salient Aspects of Machine Learning Research: A Bibliometric Analysis

Machine learning has emerged as an important and distinct area of research closely related to and often overlaps with various domains within computer science, computational statistics, artificial intelligence, cognitive science. One can observe connections with these fields at the cognitive level (in terms of theoretical framework), and on methodological levels (drawing from tools and techniques of these fields). The evolution of the field has taken a very directed and operational approach with basic tenet of machine learning being ‘teaching computers how to learn from data to make decisions or predictions’. As we move into systems that increasingly need to exploit data, we find the research in this area getting more application oriented, expansive in scope with loci of research and innovation dispersed across academia, research institutions and industry. It is thus becoming a challenging as well as useful exercise to know the structure and dynamics of this field. The paper is centered on this issue; it tries to capture the intellectual structure of this field and research trends from quantitative and statistical analysis of research publications. Conceptual connections are constructed from linkages among keywords using tools and techniques of Social network Analysis. It also acts as a conceptual framework for the study. Some indications from patent statistics are also drawn to provide some insights of the technological trends.


INTRODUCTION
Can machine learn? has been a subject of scepticism, philosophical discourse and also a serious area of research for those who had long been involved in cognitive studies, artificial intelligence and related areas of research. The imagination of developing a machine that can learn from experience can be formally attributed to Alan Turing talk in 1947 in the London Mathematical Society of a need for a machine that can learn from experience and gave the blueprint of an intelligent machine. [1] However, the formal coinage of the term 'machine learning' is attributed to Arthur Lee Samuel in 1959, his research on machine learning can be seen as giving shape to the birth of this new field. Machine learning research is seen as a subset of artificial intelligence and at the same time it has evolved to become a well delineated field in itself that focusses on finding relationships in data and analyzing the processes for extracting such relations, rather than building truly "intelli-gent systems". The vast application domain in which machine learning is being applied ranges from engineering applications in robotics and pattern recognition (speech, handwriting, face recognition), to internet applications (text categorization) and medical applications (diagnosis, prognosis, drug discovery), etc. is making this field expansive in scope and attracting researchers from different disciplinary domains. The ever-increasing role of machines in solving complex cognitive tasks is making scholars like Erik Brynjolfsson [1] arguing that "machines that can complete cognitive tasks are even more important than machines that can accomplish physical ones".
As the brief overview shows, it becomes an important exercise to capture the intellectual structure and dynamics of this field. The interdisciplinary and diverse application domain of this field makes it a challenging exercise. One of the interesting line of investigation can be through bibliometric approach. It essentially is application of quantitative and statistical methods to items such as research paper and patents; research paper taken as a 'proxy indicator' of research activity and patents of inventive activity. The rich body of scholarly work on applying this approach to capture the intellectual structure of research area provides rationale for using this approach for this study (see for example Mingers and Leydesdroff, 2015). [2] The study examines the broad trends of research and invention in machine learning through research papers and patents.
A more detailed analysis of research activity in 2018 is done to uncover the contemporary research in this area. Intellectual structure of the field was discerned by using Social Network Analysis (SNA) as a conceptual framework and applying the tools and techniques constructed from this framework.
Some recent studies have investigated machine learning and related subdomains through bibliometrics and provides a useful context to the present study. Rincon-Pationo et al. [3] has applied cite-Space, a bibliometric mapping software to capture the research trends in machine learning for the period 2007-2017. The study applies co-authorship and co-word analysis based on keywords to uncover some aspects of the research structure. The paper is published in F1000 research and is accepted with the observation among others that the paper suffers from building up the search query properly for the study. This limits the relevance of this study in terms of relating the findings to the present study. More specific studies within this field can be observed. Fu and Aliferis [4] examined the feasibility of predicting future citation counts in biomedical literature through a mixture of content based and bibliometric features using machine learning methods. Mao et al. [5] has examined deep learning research status through bibliometric studies. Categorisation of research fronts in this area in this study provides a good assessment of the areas linkages with related fields. Yu [6] explored the research activity in Support vector machine (SVM), a highly influential supervised machine learning algorithm, with particular reference to China. VOSviewer-based mapping software was applied to explore the trends in this field. These studies provide some useful insights of the methodological approach, current trends and to some extent how the machine learning research field is shaping.
The literature review also closely examined the different tools and techniques that helps to map the intellectual structure of a research field. In this regard, detailed analysis of various mapping techniques by Cobo et al. [7] provided very useful insights. Various concepts, tools and techniques of SNA also were explored for enriching the study. [8] METHODOLOGY Social Network Analysis (SNA) was used as a conceptual framework as well as an analytical tool for data analysis. It helps to characterise and draw meaning from network structure in terms of nodes (individual actors, things within a network, etc) and ties or edges (defined as relationship or interactions) that connect them. Social structures of various types can be investigated through the use of network and graph theories (see for example Wasserman and Faust, 1994). [9] Centrality' is an important concept in SNA, as it reveals the structure of a network by measuring linkages among actors in the network. There are different kinds of centrality measures to cap-ture the network structure. Four centrality measures: degree, closeness, eigenvector and betweenness centrality were used to draw meaning of the network structure. Degree centrality equals the number of ties that a vertex has with other vertices. Generally, vertices with higher degree or more connections are more central and tend to have a greater capacity to influence others. Closeness centrality emphasizes the distance of a vertex to all other vertices in the network by focusing on the geodesic distance from each vertex to all others. Betweenness centrality is based on the number of shortest paths passing through a vertex. Vertices with high betweenness play the role of connecting different groups. Eigenvector centrality measures the influence of a node in a network.
The Web of Science (WoS) database covering papers from science citation index as well as social science citation index was used for capturing the research activity in this field. The search string ["machine learning or supervised learning or unsupervised learning or reinforcement learning"] was applied to Topic Search which includes searching and extracting all research papers that had this string in title, abstract, authors' keywords and keyword plus (indexed words assigned to papers by WoS). Only research articles and review papers in English were taken. Data was taken from 2010 onwards with detailed analysis undertaken for 2018. A total of 10,372 papers were visible in 2018 using our search strategy which were downloaded and used for detailed analysis.
Bibexcel software was used for preparing the bibliometric data and creating descriptive statistics. It was also used for constructing co-occurrence matrix of keywords that could be read by the network analysis software that were used in this study namely UCINET, Netdraw, VoSViewer and Pajek. Co-occurrence matrix was used for doing the co-word analysis. Co-word analysis is based on the premise that if two keywords occur simultaneously in a paper, their intensity of co-occurrence is an indication of the relationship between the topics/constructs [10] and helps to identify the conceptual structure and the main concepts treated by the field. [11] Patent data was extracted from Web of Science-Derwent Innovation Index. Year wise patent data from 2010 was taken with broad statistics captured for 2018.

Broad Trends of Research and Inventive Activity
Using papers and patents we examine how research and inventive activity is happening in this field ( Figure 1).
Papers and patents follow an exponential trend which is a reflection of the increasing research and inventive activity happening in this field. The data covers the period 2010 to 2018 and is forecasted till 2022. The exponential trend is an indica-tion of the field expanding in scope in terms of diversity and areas of applications. More detailed analysis of research and inventive activity in 2018 is discussed in later section. One can observe that USA and China are the two dominant players and they together account for 57% of the total papers in 2018 and 50% in 2012. The growth rate between 2018 and 2012 is very high for all the countries with Russia, India, South Korea and China exhibiting the most significant increase.
The 10,372 papers on machine learning in 2018 were delineated on the basis of web-of-science assigned subject category. Figure 3 highlights the prominent subject categories in which research is happening in this area. It may be noted that a paper can be categorised in more than one subject area and thus it can lead to double counting.
It is not surprising that 'computer science' and engineering are two prominent subject area of research activity. However, at this broad delineation it is not possible to uncover the topics within which the active research is happening. The study has created more informed maps which is highlighted in later sections to provide a fine-grained structure of research activity.
Some insights of patenting activity can be observed from Table 1 and Figure 4.
It is not surprising to see from patent assignee that ICT companies are dominating patenting activity in this area. One can however see the diverse range in terms of focus areas of activity of the companies' activity patenting in ICT. Machine learning provides enhanced system capability in hardware, physical networking, e-commerce. Social networking is now another active area where algorithms are embedding machine learning. Big ICT MNCs of USA are dominant players. However, Chinese companies actively patenting in this area and displacing traditional technology leaders from Japan, Germany is a significant finding. Figure 4 shows the prominent areas in which the patents in 2018 are positioned. It may be noted that a patent can be categorized in more than one area. Thus, if a patent falls in more than one area, it is double counted.

Capturing the Intellectual Structure of the Research Field
Intellectual structure of the field was constructed from the 10,372 papers in 'machine learning' in 2018.
The most frequently occurring keywords in 2018 are shown in Table 2. Three types of learning frameworks can be seen as the most preferred keywords namely deep learning, 'support vector machine (a component of supervised learning) and reinforcement learning. Maximum number of research papers with 'deep learning' is showing further maturity of the field as unlike typical machine learning models, in deep learning the model does it by himself. New systems are exploiting 'deep learning' as it provides more capabilities; deep learning is being embedded, for example in automatic car driving system. Training the system in various ways form the next core of frequently occurring keywords. The strong linkage with computational statistics is also exhibited among the frequently occurring keywords. Artificial Intelligence (AI) may be broadly defined as the ability of a computer program to function like a human brain. Machine learning and deep learning both are in a sense subset of AI. Thus, AI prominently showing presence in machine learning papers underscore this is one of the key pathways for developing an artificial intelligent system.
Another interesting dimension is observed from centrality measures. Machine learning is the dominant keyword and defines the network which is unsurprising. Further importance of Deep Learning can be observed from the centrality measures. It has high values in all the centrality measures showing it is a key connector to many papers (degree centrality) and also acts a bridge (betweenness centrality) to connect different domains of research within machine learning. It also has high eigenvector value showing that it is connecting to keywords that are prominent in the network. Another important keyword is 'classification'. In supervised learning, classifying the input data is indicated by this term. The importance of this term occurring frequently with high centrality values indicates its importance in developing proper algorithm for machine learning. There are number of classification algorithms that includes Random Forest'. The high betweenness centrality of this topic shows that this classification algorithm connects different aspects of learning framework.
To get further insights into the structure of the research field, co-word analysis was undertaken based on indexed keywords (keywords assigned by WoS) to each article. Figure 5 is the projection of the co-occurrence relationship among the top 87frequently occurring keywords. The cut-off was based on 100 top frequently occurring keywords which after close examination showed many terms were variants of the same term. The variants were added which subsequently reduced the network relationship among 87 keywords. The relation among units (keywords) are the nodes and the relations among them represent an edge between two keywords. Mapping was done using VOSViewer [12] which builds a similarity matrix from a co-occurrence matrix. [13] This similarity measure essentially minimizes weighted sum of squared Euclidean distances between all pairs of items through an optimization process.  The structure of linkages was further understood by applying the VOS clustering technique, which is related to the technique of modularity-based clustering. [12] This algorithm generated seven clusters. Representation of a member in a cluster is identified by coloured point i.e. all points with same colour are members of a cluster. Table 3 allows us to look at the cluster structure with more clarity. To identify the representative keyword defining each cluster, we captured the degree centrality of each keyword. A keyword that had highest degree centrality was identified as the representative keyword. This unique approach provides us a novel method to identify a representative keyword. Further indicators of centrality was identified for the representative keyword. Table 3 is representation of this analysis.
The Figure 5 and Table 3 allows us to uncover insights of the intellectual structure of the research field of 'machine learning' in 2018. The representative keywords provide us with the good insight for understanding the intellectual delineation of the field. Support Vector Machine, Machine Learning, Classification, Feature Extraction, Big Data, Feature Selection, Natural Language Processing and Deep Learning are the representative keywords of each cluster. As Machine Learning is the core of the whole network, another keyword i.e. Clas-sification was also used as a representative keyword for cluster 2. Degree centrality was used as a measure to identify the representative keyword. Further measures of centrality were also taken for each representative keyword to have more informed insights.
We had earlier discussed the importance of Deep Learning (cluster 7) and Classification (cluster 2) for machine learning system. Their emergence as representative keywords in two clusters is not surprising. Each of the other representative keywords also highly influence machine learning research. Support vector machine (SVM) is a widely used algorithm in the field of machine learning and it is a research hotspot in the field of data mining. [6] Thus, this representative keyword for cluster 1 which has largest number of topics that are closely linked indicates delineating a very strong intellectual domain within machine learning research. Big data (cluster 4) is a term that describes the large volume of data, both structured and unstructured typically described in terms of how it creates new opportunities for companies that can exploit the huge volumes of data for improvement in customer services, increase in supply chain efficiency, improvement in demand driven operations, better customer supplier relationship. Research is increasingly driven by big data, leading to the so called fourth paradigm of research and thus Big data analytics has become a hotspot of academic research.
Machine learning models have become very important in big data analytics which our findings also demonstrate i.e. its research influence in this area. High closeness centrality of this topic also shows that it is connected very well across the whole domain of machine learning research. Feature extraction (cluster 3) is used to identify key features in the data from the original data set to derive new ones; to detect features such as shaped, edges, or motion in a digital image or video, etc and is key to effective model construction for machine learning, pattern recognition and image processing. Feature extraction can be decomposed under two steps: feature construction and feature selection (cluster 5) (http://clopinet.com/ fextract-book/IntroFS.pdf). Feature selection along with selecting relevant and informative features from the data can also be instrumental in general data reduction, feature set reduction, performance improvement and data understanding. These two clusters of which Feature extraction and Feature Selection are representative keywords thus delineate a core domain of research activity in machine learning. Both of them also have high closeness centrality indicating their connection across all the research topics within the field. Cluster 6 is represented by Natural Language Processing (NLP) which is essentially the interaction between computers and humans using natural language. NLP borrows from tools and techniques of machine learning to develop capabilities and sophistication for reading, deciphering, understanding and make sense of the human languages in a manner that is valuable. Thus, looking at the representative keywords and the topics within each cluster provides us the key research areas within which machine learning research is happening in 2018.

DISCUSSION AND CONCLUSION
The study draws attention to the salient aspects of research and inventive activity in machine learning by examining research papers and patents. A more detailed analysis was undertaken for research activity in 2018. The exponential trend of papers and patents highlights the importance of this area for science as well as technology. This is not surprising as the study points out the area is expansive in scope in terms of application domains, highly interdisciplinary and demand driven. USA and China dominate research activity with computer science and engineering as two major areas where maximum research is happening in machine learning.
Patent analysis shows the dominance of ICT companies, the dispersion in terms of areas is high within the field. Along showing they are also making a strong technology assertion in this area.
The fine-grained analysis of intellectual structure of the research field from papers in 2018 highlights many salient aspects of the structure of the field. The study has been able to create an informed understanding of the structure by applying co-word analysis, using mapping software and drawing insights from applying concepts and tools and techniques of Social Network Analysis (SNA). Many studies do not exploit the conceptual understanding of SNA and restrict only to the Eigenvector-0.237     mapping exercise. The present research has drawn attention to 'centrality', a key concept in SNA to derive meaning from the frequency of topics as well as the linkages among key topics. Seven distinct clusters could be identified with representative keyword identified in each cluster based on degree centrality. A representative keyword in a cluster was identified as one which had maximum linkages across the whole network. We argue that this is a robust method to identify a dominant keyword from a cluster then the typical assignment done based on high occurrence frequency. A close examination from research in this field showed each of the cluster had delineated key research areas within the field. Topics in a cluster however are not disconnected from other topics outside a cluster.
The study is expected to be useful for a wide cross-section of researchers not restricted only to bibliometrics research.