Research Fronts of Computer Science: A Scientometric Analysis

Computer science and technology have developed rapidly in the past few decades and shown an increasing tendency of interdisciplinary research in the community. Research fronts of Computer Science (CS) have attracted the attention of scientists from different background and it is a big challenge for them to discover the development trends. The study uses scientometric methods and a combination of macro and micro analysis to detect the research fronts of CS based on the data from Scopus and Scival database. Macro analysis focuses on leading countries and institutions by scholarly output and citation count. Micro analysis pays attention to the performance of institutions and their competitors in research fronts and helps researchers understand frontier topics of specific research field. This paper provides a comprehensive and finer-grained analysis about the research frontier topics of CS domain. The insights obtained from the analysis are for both researchers and policy makers.

• What is the development profile of a frontier topic?This paper is organized as follows.In "Related work", we discuss previous related works about research fronts detection in CS.Section "Methodology and data" introduces the methodology followed and data used.Section "Results and discussion" describes analysis of research fronts in CS and several case studies in micro and macro levels respectively.Section "Conclusion" presents a summary of the work.

RELATED WORK
Scientists have proposed various methods to detect research fronts, which can be divided into two approaches: Citationbased and term-based.[8] Research fronts can also be expressed by the emergence of new topics or changes in the relationship between keywords.In CS, several studies have analyzed the topic trends or research fronts.Tattershall et al. [9] explored a stock market-inspired burst detection methodology to the free text of a large corpus of CS abstracts which is gathered from DBLP.2.6 million articles from 1988 to 2017 were used to detect popularity of research topics.It turns out that topics such as "deep learning", "word embedding" and "fog computing" were in the top bursty terms list and would rise in popularity in the future.Wu et al. [10] studied the research topic trends by analyzing the evolution of topics the authors with uninterrupted and continuous presence worked on.They found that the community showed an increasing tendency of interdisciplinary research.Hoonlor et al. [11] utilized bursty words detection to study trends of CS.On the other hand, text-based analysis is applied to identify major research topics of CS in some countries.Uddin et al. [12,13] identified topic trends through frequency of keywords in Mexico and the SAARC countries.Based on the research output data of 100 most productive institutions from India and the world over different time periods, Singh et al. [14] implemented the burst detection algorithm to analyze research topics.
The abovementioned studies used keyword frequency or burst words to detect topic trends or frontier topics.But sometimes it is not easy to fully express the meaning of research fronts through a single keyword.Burst words such as "artificial intelligence", "semantic web" and "data mining", cover a wide range and are not precise enough to represent the specific meaning of research fronts.And it is also not easy to know the institution's involvement in research fronts and effectively evaluate the relationship between research direction and research fronts.
Our work combines with the direct citation method, which has been proven to detect large and young emerging clusters earlier and show better performance in detecting research fronts. [15]Different from the previous methods of using a single keyword, our work uses three words simultaneously to express a topic, which can represent the topic more accurately.In addition, leading countries and institutions in each frontier topic and the performance of institutions and their competitors in research fronts are also discussed.As far as we know, our work is the first to make a comprehensive and finer-grained analysis of research fronts in CS.

METHODOLOGY AND DATA
Subject classification is based on All Science Journal Classification (ASJC) subject system of Scopus database and CS is one of 27 major subject areas in ASJC classification.This study uses Scival, a scientific analysis tool, to generate research topics of CS.It is based upon a direct citation analysis of 75 million literature data in Scopus database from 1996 forward.Different topics were generated by clustering the direct citation references.The prominence of each topic, which is an indicator of the momentum of a particular field, is calculated by citation count, views count and journal impact based on the literature data during 2017-2019.Research fronts could be obtained through threshold setting of the prominence value.Data collection time was October 30, 2020.Analyzing schema is depicted in Figure 1.

Creation of topics
The general process of the direct citation is as follows: First, a list of citing-cited pairs is created.Each pair is assigned a weight based on the link relationship and the weight a ij between each pair of papers i and j is set to 1/k where k is the number of edges for the paper j. [16] Then, papers are assigned to clusters via VOS algorithm, which uses a variant of modularity-based clustering and attempts to maximize the ratio of links within clusters to links between clusters. [17]Each topic consists of a set of publications with a common focused intellectual interest and one publication can only belong to one topic.

Generation of topic names
Each topic is named by extracting from Elsevier Fingerprint Engine, which uses Natural Language Processing techniques to mine the text of titles, abstracts and keywords in the literature.Each topic consists of three distinctive keyphrases.The first two are generally high-frequency keyphrases which are selected to provide a macro-level description of the topic in the research field.And the third keyphrase is a more specific description of the topic.

Calculation of topic prominence
Prominence is calculated by the combination of recent citation count, recent Scopus views count and CiteScore value.Scopus views count is the sum of abstract views and clicks on the link to view the full-text.CiteScore is an indicator to evaluate the academic influence of a journal.To calculate prominence, the following variables are considered by the topic and year n: [16] • Citation Count in year n to papers published in n and n-1; • Scopus Views Count in year n to papers published in n and n-1; • Average CiteScore for year n.
According to the analysis of Klavan's and Boyack, [16] the three variables (Citation count, Views count, CiteScore) are highly strongly correlated, through three-variables factor analysis, the normalized factor scores are calculated as 0.495, 0.391, 0.114, respectively, which represent the weight of each variable.Then prominence of topic j in year n is calculated as the following equation: where cj is citation count to articles in cluster j published in years n and n-1, vj is Scopus views count to articles in cluster j published in years n and n-1 and csj is average CiteScore for articles in cluster j published in year n.These raw values are log-transformed into the values used in the formula as Cj = ln(cj + 1), Vj = ln(vj + 1) and CSj = ln(csj + 1).

Selection of research fronts
The percentile is calculated after sorting by the topic prominence.The higher the prominence percentile, the more attention the topic receives and the better its growth momentum.The prominence percentile is calculated based on citation count and views count of publications in the past two years, which reflects characteristics of high attention and novelty, so it can represent research fronts.According to experiences, topics with prominence percentile greater than 90 are considered as research fronts, while those greater than 99 are hot research fronts. [18,19]

RESULTS AND DISCUSSION
Q1: What are research fronts or hotspots in CS?
CS covers 15,460 topics, accounting for 16% of the total topics, including more than 500 research frontier topics (prominence percentile>90) and 136 hot research frontier topics (prominence percentile>99).Table 1 lists the 20 research frontier topics with highest prominent percentiles, we can see that it covers popular fields such as deep learning, block chain, natural language processing, recommendation systems and Internet of Thing, etc.The topics listed are a more granular portfolio analysis of a research field, e.g.hot topics related to natural language processing, include named entity recognition and product viewpoint mining, etc.
According to the statistics of the scholarly output of the top 500 frontier topics, the least is 680, the most is 19,030 and the median is 1048.Table 1 shows that the scholarly output of the top 20 topics is almost all over 2000 during 2017-2019, except for topic "Interatomic Potential; Potential Energy Surface; Material Science", which is a cross-topic and mainly about the application of machine learning in the field of materials science.Journals in the field of materials science often have a high CiteScore, so this topic also has a high prominence value.Previous study showed that there was a moderate positive correlation between the number of publications and the prominence ranking of topics, [18] that is, the more the number of publications in a topic, the higher the prominence value might be.At the same time, Field-Weighted Citation Impact (FWCI, a normalized impact indicator, more than 1.0 of which indicates publications have been cited more than the global average for similar publications) of these topics are all above 1 and some of them more than 2, indicating that these topics also have a higher influence.
Figure 2 shows the top 100 frontier topics by the scholarly output in CS.Each bubble represents a topic and the size of the bubble indicates the scholarly output of a topic.The position of the bubble is determined by the ASJC subject.The closer the bubble is to the center, the more multidisciplinary the topic is.For example, the biggest bubble is "Object Detection; CNN; IOU", which has the most scholarly output.The topic "Exome; Copy Number Variation; Whole Genome Sequencing" belongs to bioinformatics, showing strong multidisciplinary characteristic.
Q2: Which countries and institutions are leading in research frontier topics of CS?    Beijing University of Posts and Telecommunications ranks the first in the academic sector with 12 hot research frontiers, followed by CNRS with 7 leading topics.There is a total of 11 institutions with more than three leading topics (Figure 6).According to citation count, Chinese Academy of Sciences leads in 7 hot topics, Alphabet Inc. leads in 5, followed by Beihang University, Harvard University and University of California at Berkeley with 4 leading topics (Figure 7).Chinese Academy of Sciences' leading position by citation count is less obvious than that by the scholarly output.
Q3: Which research frontier topics are our institutions and the competitors currently active in?-A case study This study takes Southwest Jiaotong University as an example and Beijing Jiaotong University is selected as a benchmarking institution.This question focuses on the topics of two universities as key contributors.If the institution has at least 1/3 as many papers (or 1/3 as many citations) as the top publishing institution (or the top cited institution) in a topic, it is considered as a key contributor in a topic.This study selects 6 topics for which the two universities are the key contributors and with large amount of the scholarly output in this domain.Table 2 and Table 3 show that prominence percentiles of the most productive topics of two universities are all greater than 90, indicating that their main research directions are the current frontier topics.Beijing Jiaotong University mostly concentrates in the field of ground transportation.
According to the scholarly output, it ranks first in the topics of scheduling, traffic flow prediction and pedestrian evacuation, etc. Southwest Jiaotong University mainly contributes in fault diagnosis, magnetic levitation, traffic scheduling and rough sets, ranking first in incomplete information system.

Q4: What are frontier topics in a research field？-A case study
Analyzing the distribution of publications in a certain research field over time can show the development and changes of the research problem and also help us discover the key nodes and emerging trends.In this study, we take Natural Language Processing (NLP) as an example to explore the key nodes and emerging trends of this research problem.NLP is an important direction in the field of CS and is regarded as one of the core    problems of AI-complete.There are five hot topics related to NLP, as shown in Table 4.
Figure 8 shows the trends of these five topics in the past 10 years, which have been increased gradually.Two topics with the most scholarly output are T.108 and T.1614, with more than 2000 in 2019, respectively.T.108 is about sentiment analysis and opinion mining, which is the computational study of people's opinions, appraisals, attitudes, emotions toward entities, individuals, issues, events, topics and their attributes. [20]At present, a large number of comments have been accumulated on e-commerce sites.Comment information is closely related to people's daily life and is widely used by consumers and business organizations.When ordinary consumers buy a certain product or service, they generally refer to the comment information of previous users to obtain a feedback.Review information on e-commerce sites generally has better structure and is widely used by academia and industry.Before 2017, this topic is the most productive among the five hot topics.
It is mainly about the words and sentences embedding representation which can be applied to various downstream tasks, such as sentiment classification and textual entailment.It has become an important part of NLP system based on deep learning.In 2013, Google released a tool, word2vec, for word vector calculation, providing an efficient method for learning high-quality word vector representation from a large amount of unstructured text data, which has attracted great attention of industry and academia. [21,22]This topic entered a period of rapid development after 2013.Then from GloVe, [23] ELMo [24] to Bert, [25] which were the most cited in 2014, 2018 and 2019 respectively in Scopus, language representation achieves milestone development and has become one of the fastest growing topics at present.T.4431 focuses on the application of NLP in medical data, especially electronic medical records.
It can be seen that although it is one of hot topics in NLP field,   but it is not developing as fast as T.108 and T.1614 (Figure 8).The main reason is that this field faces problems such as nonopen data, data islands, data privacy and ethical issues.Data quality, structuring and standardization of medical records are the primary issues that need to be resolved.In September 2016, the Laboratory for Computational Physiology of MIT released the third edition of MIMIC-III (Medical Information Mark for Intensive Care) data set, comprising information relating to patients admitted to critical care units at a large tertiary care hospital. [26]Since then, the number of publications on this topic has increased.
In contrast, T.30920 and T.22847 are not the most productive, but they still get a high prominence.T.22847 is about machine translation and the most cited paper comes from Google's implementation of an English to French translation task with a multilayered Long Short-Term Memory (LSTM) in 2014. [27]This article proposed the use of RNN Encoder-Decoder in neural machine translation (NMT), that was, the well-known Seq2Seq model and also laid the foundation for NMT.Since 2015, the number of publications of this topic has gradually increased.Especially in 2016, Google released Google Neural Machine Translation (GNMT), which meant that NMT has become the absolute mainstream of modern machine translation and one of the most popular topics in the current NLP field.T.30920 mainly focuses on image captioning, which can automatically describe the content of images.It is a fundamental problem in artificial intelligence that connects computer vision and NLP.The most frequently cited publication in this topic comes from Microsoft' open data set, namely COCO (Common objects in context), providing a data basis for subsequent image caption research.Starting from the work of Show and Tell [28] published in 2015, the field of image description has developed rapidly in recent years.Models have been improved gradually by adding attention mechanism, visual sentinel, improved CNN, reinforcement learning and object detection.This topic combines two major directions of artificial intelligence: Computer vision and NLP.
It is also one of hot frontier topics in NLP field.

Q5: What is the development profile of a topic? -A case study
The topic "Captions; Question Answering; Image Annotation (T.30920)" is a relatively new field and has developed rapidly after 2015 (Figure 8). Figure 9 shows the top 50 key phrases, including question answering, caption, video, semantic, NLP system, computer vision, etc.This topic contains two main tasks: Image captioning and visual question answering.They use a combination of computer vision and NLP technology to deal with images and text in order to get the answer of the image question, which can be applied to image retrieval and life assistance for the visually impaired, etc.
In this topic, the most productive countries are China (741 scholarly output) and the United States (555 scholarly output), both of which account for 73% of all publications during 2017-2019.Most active institutions are shown in Table 5.
Chinese Academy of Sciences has the most scholarly output, while Microsoft USA, Facebook Inc, Georgia Institute of Technology and Alphabet Inc. have a higher impact.Georgia Institute of Technology has the highest percentage of highly cited papers and its top 10% citation percentiles is 68.3%.The most active authors are shown in Table 6.These authors not only have a high scholarly output, but also have a high proportion of highly cited papers.The top 5 most active sources by scholarly output are shown in Table 7.The top conferences such as CVPR, ICCV, IJCAI and NIPS are the  From the case studies, we can see that research fronts are useful for researchers and policy makers to make analysis and plan.Research frontier topics can provide researchers with a clear picture of their overall research performance and insight into the momentum of particular topics.Research managers can evaluate the relationship between research direction and research fronts.They can also make comparative assessments of competing institutions.Previous studies have found that topic prominence value has a strong correlation with funding's, which is useful for stakeholders and their needs related to the portfolio planning. [16]Based on analysis of the most productive countries, institutions and authors, researchers and policy makers can look for collaborations with these authors or institutions.
Nevertheless, this paper only utilizes the data in Scopus.In order to identify research fronts more comprehensively and objectively, multi-source data should be applied, such as patent data, science and technology planning texts and fund project data.Meanwhile, prominence doesn't equal to importance, we might have overlooked some low prominences but still important topics.Future research will add more data sources and incorporate more indicators in order to obtain research frontiers more scientifically.

Figure 1 :
Figure 1: Analyzing schema of this study.

Figure 3
Figure 3 shows the top 10 countries by scholarly output during 2017-2019.China, the United States and India are the most

Figure 2 :
Figure 2: Top 100 frontier topics by scholarly output in CS.

Figure 5 :
Figure 5: Leading countries by citation count.

Figure 4 :
Figure 4: Leading countries by scholarly output.

Figure 7 :
Figure 7: Leading institutions by citation count.

Table 4 : Hot frontier topics in NLP. Topic Topic Number Citation Count Scopus Views Count Average CiteScore Prominence percentile
Journal of Scientometric Research, Vol 10, Issue 1, Jan-Apr 2021

Table 7 : Top 5 productive Scopus sources in Topic T.30920.
This study presented research fronts of CS, especially hot frontier topics from both theoretical and empirical perspectives, which was based on direct citation with a global mapping and created accurate topics in a fine-grained way.Leading countries and institutions were selected in terms of scholarly output and citation count.China is the most productive country and USA is the most influential country in CS research.Chinese Academy of Sciences is the leading institution in both scholarly output and citation frequency.Other government research institutions such as CNRS, RIKEN and CSIRO are leading in related hot topics.Corporate sectors are also active in hot frontier topics, such as Alphabet Inc., Ericsson AB, Facebook Inc, AstraZeneca, Lucent and Toshiba, etc. Academic, government and corporate sectors jointly lead the direction of computer science technology.Multi-party participation can quickly transform research output into innovative products and benefit for mankind. CONCLUSION