Relevance of Innovations in Machine Learning to Scientometrics

Machine learning envisages building models that either classify, predict, cluster or determine the relative relevance of features to a problem and the associations between them. This paper briefy describes how these tasks are relevant to Scientometrics. Through this brief survey of selected tasks, it is observed that most solution approaches in Scientometric literature are built on the strong foundation of understanding and debating in uencing factors and the process of feature engineering, requiring the descriptors to be intuitive and methods used for classication, prediction, etc., to be amenable to interpretation. Recent trends in machine learning, particularly, deep learning methods, however, pose an interesting question: can we build models that automatically determine what features are important and thereby bypass the step of feature engineering? This paper discusses how such techniques could also be harnessed in Scientometrics.


INTRODUCTION
Most computational tasks in Scientometrics can be understood broadly to involve the design or application of features to gain an insight to the (relative) impact of innovation or research of institutions, scientists and avenues of knowledge dissemination (such as journals, conference proceedings, etc.).Eliciting relevant numerical descriptors for quantifying such impact seems to require a deep understanding of the influencing factors.For instance, the Hirsch number or h-index, a relatively popular measure for the citation index of an author, is computed as max min( ( ), ), k c k k where c(k) is the number of citations of the k th publication, listed in the descending order of citations. [1]This measure does not favor a large number of poor quality publications or a very small number of highly cited articles.Similarly, the THE-QS based on academic prestige through features that quantify the quality of teaching, research, citations, international outlook, etc.[4][5] However, our interest in these measures is to note that the selection of features for such problems requires some knowledge of the domain world University rankings ranks institutions Given the definition of each measure, automating their computation on relevant data is fairly straightforward.Indeed, it would appear that most tasks in Scientometrics involve computation of scores from data, once the measures are defined.However, when we are required to perform computations that build on the insight gained from past data for prediction (such as: predict the ranking of an institution at the end of the year, based on the measures computed a few months into the year) or tasks such as mining underlying patterns (for instance, what should an institution X focus on to improve its ranking in the coming years?what strategy should a publication employ to ensure articles in a particular area are read and cited?) or other descriptive tasks such as grouping institutions, individuals or journals with a similar subset of parameters or filtering publications in a certain research area, etc., techniques such as classification, regression, clustering, association rule-mining, etc., borrowed from machine learning prove to be useful.
The subsequent section presents a brief survey of machine learning techniques used in Scientometrics, followed by a summary of the recent innovations in machine learning (particularly, the power of deep learning networks) and a discussion of their relevance to Scientometrics.

MACHINE LEARNING IN SCIENTOMETRICS
Machine Learning techniques can be broadly discussed under the heads of supervised and unsupervised learning.Supervised learning comprises tasks such as classification and prediction for which models are designed for training data with target annotations.Unsupervised learning or clustering deals with finding groups of similar data points.This latter task does not presuppose any annotation of the training data.The goal of supervised learning is to achieve a model that maps an input to its target output for the training set, replicating the logic behind the annotation process on test data and hence produce expected results.In the case of unsupervised learning, goal is to group similar data points or partition the feature space to natural groupings.Thus, besides the advantage of not requiring the data to be annotated (a tedious and time consuming effort), there is scope to discover novel underlying patterns and gain an insight to the data through unsupervised learning. [6]assification Machine learning for classification can be abstracted to a system that preprocesses the raw input (involves tasks such as cleaning the data to eliminate erroneous data, identify missing data, outliers, etc., and take an appropriate action -such as interpolating some missing data or eliminating anomalies, to render the data amenable for further tasks in the process pipeline), extracts features (and possibly, even selects a subset of relevant features) and creates a model based on the features.The process of creating a model or 'training the classifier' is iterative and refinement is based on evaluating the model through crossvalidation (presenting some of the annotated data (not used for training) to 'test' the system).Suitable adjustments to the model may be made to ensure the errors are minimal, there is no systemic 'bias', that the model does not 'overfit' the training data and to assess the significance of the average performance (or describe how repeatable is the performance on one set of validation data) through multiple folds of validation. [7] example of a classification task in Scientometrics could be automatically classifying the category of a citation to be able to retrieve the most relevant/ useful references when required.Garzone and Mercer start with the observation a citation has multiple purposes such as paying homage to predecessors, acknowledging the use of some equipment or technique, questioning, agreeing with some result, etc.They present a rule-based classifier to label 35 categories of citations. [8]other example is the related task of determining the polarity of a citation (positive or negative).This has been accomplished using linguistic features that describe the context of the reference. [9]It is noteworthy that the solution approach to these tasks hinge on the choice of meaningful features that capture the essence of the relevant content as well as the careful design of rules, which happen to be intuitive and easy to interpret, to achieve a meaningful outcome.

Prediction
Prediction of a target variable is the task of building various types of models, typically a weighted function of the current and/or 'past' data and other parameters that influence the outcome of target variable, to forecast the value it might take on at a specified time in the future. [10]A popular approach to predict values is through designing regression models whose output is a real number vis-a-vis a category or a number signifying a class label as with classification.For instance, the number of citations an article could be expected to have based on the number of authors, institutions, citations of individual authors, etc., is a predictive task. [11]A recent study has proposed the use of relevant features with neural networks to predict articles that would be highly cited. [12]Another interesting study has shown that the most cited articles (in Medical Research) can be predicted based on the number of tweets within the first three days of the article being published. [13]uantile regression has been used to model the probability distribution of the future citation count of articles. [14]Through this it has been shown that potential long term impact of various articles can be predicted.Another study has shown that the keywords of the abstract (modeled using a bipartite graph) can be used to determine the number of citations in the future, with articles having higher citations conforming to the mainstream. [15]There have been a number of such studies that have compared the merits of various predictive models, most notably variants of regression. [16,17] Feature Analysis As we noted earlier, each of the foregoing methods requires an understanding of the influencing factors that are most relevant to the problem.A straight-forward supervised approach to classification is the k-nearest neighbor method (abbreviated as KNN).This method matches a set of features of a sample that needs to be annoated with others in the training set.It then selects a label that is common to the k most similar samples.[18] While prediction tasks are typically accomplished through studying correlations between features, there are more sophisticated techniques used and interesting questions that can be asked when we automate the process of finding appropriate factors.For instance, what makes an article influential?
Multivariate analysis has been performed to elicit this information. [19]Likewise, associations between features can be studied to come up with recommendations.As an example, research collaborations have been suggested based on predicting the link between research centers doing similar work using random forest classifiers. [20]Gini Index has been used in this study to determine the relative importance of features in determining the recommendation.Another multivariate analysis technique to understand the relative importance of features and a method traditionally used to arrive at a subset of weighted features that serve as strong predictors is the principal component analysis (abbreviated as PCA). [21,22]ssociation rules are implications or bijections that describe the relationship between two features.It seems the one of the most natural methods to arrive at relationship between explanatory variables.Even though there is much to be explored with mining of association rules, there have been a few examples of how insightful this can be.For instance, co-occurance of keywords with authors has been used to mine for frequent patterns resulting in association rules for authors and keywords or Journals and keywords. [23]ustering Clustering is a process of grouping data based on the similarity between the features (most methods seek to minimize interclass similarity and maximize intra-class similarity).This is a popular approach in Scientometrics for two reasons: aggregating data helps summarize the results for data points that are similar and there is no need for a large dataset of annotated data.Unlike the case of supervised learning where class labels or category boundaries may require some justification, the patterns that emerge through clustering can be used to gain some insight.The aspects that need some attention when using clustering are: the choice of similarity (or dissimilarity) measures -how well does it capture the inherent relationship between features?and the other is the choice of clustering algorithm.There are a number of approaches that can be used for clustering.The most popular approaches are agglomerative hierarchical clustering and DBSCAN.The former results in a dendrogram that augurs for a neat visual representation.The latter, DBSCAN, is a density based clustering method that takes into account the lack of homogeneity in the spread of data.An example of aggregation in Scientometrics is the clustering of journals and category labels at various levels. [24]or the DBSCAN, an example could be of arriving at a paper recommendation based on the proximity of citations. [25] is often seen that methods are rarely used in isolation, but in combination.For instance, an interesting problem is that of tracking changes in trends.In particular, changes in patent citation networks (i.e., clusters) have been studied over time to describe growth (an increase in the number of citations), contraction (a reduction in the number of citations), merging and splitting of citation networks, and the birth and death of a network from an existing one.Subjective and objective measures have been combined for the task with the hope the method identifies, for instance, the advent of new technological areas before the US Patents Office recognizes them. [26]nother multivariate model analyzes citation networks of articles to infer that for a higher h-index, it is advisable to publish with a large number of co-authors, particularly those who have been highly cited. [27]For this, the authors consider a network of co-authors that is centered around an individual author ('ego-centric networks').
Since the design of a new heuristic or modeling approach in machine learning is not the objective in Scientometrics vis-a-vis the choice of features and interpretation of the outcome, few papers in the area have detailed explanations of the mathematical underpinnings of the methods used or algorithmic details such as parameter turning.A broad overview of Machine Learning methods and how they apply to Scientometrics can be culled from. [28]

INNOVATIONS IN MACHINE LEARNING
Most of the effort in Machine Learning was focused on finding representations of data that are descriptive (for clustering or predictive analysis) or discriminative (for classification), understanding their interrelationships (correlation, multivariate analysis, association mining) and arriving at meaningful subsets (principal component analysis), etc.These tasks presupposed an understanding of the domain and an ability to preprocess the data followed by the design meanignful features.An enormous innovation in machine learning has been to outsource the task of feature engineering to machines.Central to this innovation is the question: can machines determine, on their own, representations of the data that matter for a task?It turns out that this is possible remarkably well through, what is quickly evolving to be a tool of choice across fields, deep learning. [29]This has had a particularly high impact for problems in which the dimensionality of the original data is huge, for which there is a very large volume of data points and the complexities are prohibitive to manually comb through the data to annotate training images and engineer meaningful features, such as classification of over a million images belonging to over a 1000 categories.
Deep learning has at its core an artificial neural network -the same idea used in the foregoing section to explain the principle of classification -a model that maps the input to a target label.The only difference is that there is a nonlinear function that computes the weighted sum of input features.While a single layer neural network returns a nonlinear map of some weighted combination of the features, it was explained that a network with multiple such layers partitions the feature space through arbitrary unions of finite intersections, representing different regions corresponding to the categories. [30]Building on that principle, when multiple nodes in each layer are stacked upon multiple such layers, hence the name 'deep neural network', it manages to extract features that are relevant.And, through multiple epochs of training, adjusts the weights assigned to these features to arrive at a meaningful decision. [31]n fact, it has been shown that such a deep neural networks can outperform traditional feature selection and dimensionality reduction approaches such as PCA. [32]nce the need for explicit feature engineering is obviated, different combinations of the number of layers in a deep learning network, the nonlinear functions used, the loss function based on which the weights are optimized, choice of learning rate and regularization procedures to overcome overfitting and mushrooming of off-the-shelf pre-trained models and computational tools to code deep learning architectures there has been an implosion of scientific papers on the theoretical aspects of deep learning and even more on the application of deep learning to solve problems in various fields.Some of the techniques and their progression have been summarized in various surveys. [33,34]7]

Relevance of Deep Learning to Scientometrics
What role would deep learning have to play in Scientometrics that relies on the design of features that can be understood and discussed?Since a lot of the predecessor work on machine learning hinges on analysis of content, this can be done with even more content using deep learning.For instance, when similarity groupings between journal articles were done, proximity measures had to be defined.Keywords of articles do not always match similar papers accurately and extending the matches to keywords extracted from the content could be colored by the length of the paper, context, etc.These are circumvented through use of language embedding models with deep learning.A word embedding, such as Word2Vec for example, coverts every word to a d-dimensional vector (typically 100-300 dimensions have been found empirically to be useful), rendering words used in similar contexts to be more similar than words that are literally closer. [38]Thus, words such as king and prince would have vector representations with a smaller distance between them than word pairs such as king and kind or prince and price.
Language models have been used to good effect with semantic ranking of papers in PubMed. [39]Similarly, content can be studied for proximity between citation contexts using such language embedding models to arrive at more meaningful reference retrieval systems.Language models can also be used for sentiment analysis of the reference context having a positive or negative connotation. [40]Deep Learning can be used with unsupervised learning to group similar content (document clustering). [41]Further heuristics, such as citations or frequently used keywords, etc., can be extracted from these that can be interpreted.Node representations through deep learning architectures can be used to discover network communities within large domains of scientific publishing. [42]eading with Caution For low-resource data, overfitting is a problem with large network architectures.Transfer learning has found to be useful. [43]It remains to be seen if models built for tasks in other domains can be retrained with less effort for similar tasks in Scientometrics to achieve meaningful outcomes.
If neural networks were treated as a black-box for not being able to interpret the weights and partitioning of the feature space, deep learning networks have proved to be a 'blacker' box, in that even the features are not easily amenable to interpretation.It has also been shown through applications that deep learning is prone to errors in adversarial settings.This has limited the use of deep learning in fields such as healthcare, where it is imperitive for a computational model to be robust and 'transparent'.There has been some effort to address these limitations in the recent times. [44]e spotlight in the recent times has also turned towards understanding metaheuristics for deep learning.What loss functions work better for an application?What assumptions on the data/ loss function expedite convergence?How should the learning rate be selected to avoid local minima?Is it possible for fewer epochs of training or smaller training sets to be used to achieve high performance measures achieved with vast amounts of high dimensional data and training over several epochs?For instance the Saha-Bora Activation Function (SBAF) has been used to explain the rise in ranking of the journal Astronomy and Computing over its predecessors. [45]erhaps, a close look at the theoretical underpinnings of the methods can lend a deeper insight to the features that matter and pave way for harnessing the power of deep learning more effectively in Scientometrics in the future.

CONCLUSION
The field of Scientometry has benefited from computational advancements in Machine Learning in the past.Some instances include the analysis of social media postings to forecast the citations a Journal article might receive, analyzing the sentiment of a citation to determine if it has been used to strengthen an argument or rebut it and the context of a citation to retrieve relevant references.We have also noted the complexity of designing heuristics to rank institutions or journals or quantify the scientific impact of an author as these are beset with some bias inherent to how the measure is defined.The explicit choice of influencing factors has led to debates about the relative importance of features and paved way for new measures to evolve.The advances in Machine Learning in recent times, particularly deep learning that obviates the need for explicit feature engineering, has proved to be most useful in other domains such as computer vision and linguistics to solve a plethora of problems considered computationally intractable earlier.Given that, by design, deep learning takes away the transparency of features, it remains to be seen how the community will take to adopting these methods for Scientometrics.While the limitations are obvious, it can be argued to eliminate human biases.As suggested by empirical evidence, computational methods that require little intervention can be used to explain perplexing trends in the data.However, the choice of these methods would require a deep insight to the workings and foundations of the methods.It seems like a possibility that the right use of deep learning methods may even lend some new insights in Scientometrics.