Identi ﬁ cation of data mining research frontier based on conference papers

Purpose – Identifying the frontiers of a speci ﬁ c research ﬁ eld is one of the most basic tasks in bibliometrics and research published in leading conferences is crucial to the data mining research community, whereas few research studies have focused on it. The purpose of this study is to detect the intellectual structure of data mining based onconference papers. Design/methodology/approach – This study takes the authoritative conference papers of the ranking 9 in the data mining ﬁ eld provided by Google Scholar Metrics as a sample. According to paper amount, this paper ﬁ rst detects the annual situation of the published documents and the distribution of the published conferences. Furthermore, from the research perspective of keywords, CiteSpace was used to dig into the conference papers to identify the frontiers of data mining, which focus on keywords term frequency, keywords betweenness centrality, keywords clustering and burst


Introduction
In the era of "Internetþ," big data have become the focus of attention. How to mine and use these massive information is the significance of scientists studying big data. Generally speaking, data mining refers to an engineered and systematic process of mining implicit and previously unknown but potentially useful information and patterns from large amounts of data. Through data mining, engineers can propose new algorithms and models and medical institutions can develop new antibodies and drugs. The authors can say that data mining provides new ideas and methods for the research of scientists.
In recent couple of years, the research results of data mining discipline have been increasing in number where the published volume in international famous journals and conferences continues to grow. To investigate the research trends of data mining in a more comprehensive and meticulous manner, this paper discusses data mining discipline based on the papers of international authoritative conferences.
The concept of the frontier of the discipline has been revised and enriched by other scholars since it was introduced by Price in 1965. Price (1965) believed that the research frontier was time-varying, that is, the research frontier changed with time. For the discipline field, the process of changes in the frontier of research basically represents the development process of this discipline. There are many concepts related to the research frontier, such as hot topics, emerging research areas, emerging topics, emerging trends and potential knowledge. The identification methods of the research frontier are roughly divided into two categories which are qualitative research method and quantitative research method. The qualitative research method is relatively mature and the quantitative research method is still developing and improving. In this paper, the discipline frontier is considered as the same concept with the research frontier.
Currently, there is no clear and uniform definition of the research frontier. The definitions are broadly divided into three categories: (1) defining the cited literatures as the research frontier; (2) defining the citing literatures as the research frontier; and (3) defining the burst words or hot topics as the discipline frontier.
There are many concepts in the field of information science similar to the research frontier, such as emerging trends, emerging topics and research hotspots. The concept of emerging trends was proposed by Kontostathis et al. (2004), which refers to the subject areas that have gradually attracted interests of people over time and are being discussed by more and more researchers. The concept of emerging topics was proposed by Matsumura et al. (2002), which refers to a set of emerging subject areas represented by multiple keywords or phrases in a particular scientific research field. It represents the most promising research directions or trends in the field of discipline research. Although there are no clear definitions of research hotspots, they have been widely used and are collectively referred to the hot papers.
There are numerous methods for research frontier detection, such as co-citation analysis by Small (1973) and White and Griffith (1981), coupling analysis by Kessler (1963) and Weinberg (1974) and co-words analysis by Morris and Yen (2004) and Bhattacharya and Basu (1998). And they have been used in various disciplines to help discover research topics from research papers. Many bibliographic elements, such as co-citation, coupling, keyword and authors, are used to detect intellectual structures. Among these elements, keywords carefully chosen by authors can best show the main ideas of the manuscripts. Almost all previous research work detects the research frontier derived from the journal papers. However, conference papers are crucial to data mining research community.
In this paper, the international authoritative conference papers in the field of data mining were taken as the research object. Then, keyword term frequency analysis, centrality analysis, keyword clustering and burst word analysis methods are adopted to determine the frontiers and trends of data mining discipline.

Data acquisition
The authors select Google Scholar Metrics as the data acquisition standard, because data mining is a relatively new research field and it is not a subcategory of an existing discipline. Google Scholar Metrics (2017) provide an academic evaluation standard to help researchers to assess the visibility and influence of recent articles in scholarly publications. One of the important metrics is H5-index. For a publication, H5-index is the H-index for articles published in the past five complete years, i.e. it is the largest number h such that h articles published in the past five years have at least h citations each. The authors take the top international famous conferences of Google Scholar Metrics included in the subcategory "Date Mining and Analysis" of category "Engineering and Computer Science" as data source (Table 1).
There are nine conferences in total. To guarantee data integrity and reliability, using the Web of Science (WOS) and Scopus databases to complement each other, the search strategy is as follows: selecting the core database of WOS as the database; filling the conference name with "corresponding conference name;" selecting time from 2007 to 2016. Finally, the bibliographic information of 11,870 conference papers was downloaded from WOS and Scopus.  (598). From 2007 to 2016, the number of papers published in data mining conferences showed an increasing trend ( Figure 1). It shows the increasing emphasis on data mining discipline in the field of engineering and computing over years where the research on this situation is also deepening. It can be seen from the broken line chart that although there is some fluctuation in the amount of the published papers in individual years, the magnitude of the increase is exactly the opposite of the journal papers. It has a larger growth before 2010 and after 2013 and the increase between 2010 and 2013 is relatively flat. To a certain extent, it illustrates that the research on data mining based on conference papers goes toward the peaceful trend.

Methodology
In this paper, the authors intend to use keyword term frequency analysis, keyword clustering analysis, keyword burst analysis and keyword betweenness centrality analysis for the research frontier of data mining based on conference papers.
2.3.1 Tools. The research uses CiteSpace 5.0.R4 as the main analysis tool. CiteSpace is a visualization tool for the academic literature analysis. The main function is to detect hot topics and trends in a certain subject or filed based on specific algorithms. Based on the principle and algorithm of co-citation/co-occurrence analysis, CiteSpace provides cooperative network analysis of author/institution/country, co-occurrence analysis of term/keyword/category, cocitation analysis of reference/author/journal and burst analysis technology for frontier detection. It also provides output and visualization of three views, including clustering, timeline and time zone. Researchers can make various parameter settings according to the needs.
2.3.2 Methods. In this paper, the authors intend to use keyword term frequency analysis, keyword clustering analysis, keyword burst analysis and keyword betweenness centrality analysis for the research frontier of data mining based on conference papers. Keyword term frequency analysis, betweenness centrality analysis and burst analysis are from the micro Annual published papers of data mining authoritative conferences IJCS perspective to analyze, which aims to identify the hot topics by measuring keyword term frequency, betweenness centrality and burst metrics for the research frontier identification. The clustering analysis is more focused on macro analysis.
2.3.2.1 Keyword term frequency analysis. The basic principle of term frequency analysis is to determine the research hotspots and their trends by the frequency of the occurrence of the words. Term frequency analysis can reflect the hotspots in certain research field by keywords with given thresholds. Higher term frequency reflects that researchers pay a higher attention in this research field. Studying the subject matter of the literature cannot only reveal its hotspots but also reveal the time distribution of the research topics in combination with the term frequency and then identify the research hotspots and trends.
2.3.2.2 Keyword betweenness centrality analysis. The definition of centrality in the CiteSpace menu is betweenness centrality. It is an indicator of measuring the importance of the nodes in the network. This importance measurement method for the nodes was proposed by Freeman (1978). In the visual knowledge network map, the nodes with higher betweenness centrality are highlighted with purple circles. The higher the degree of centrality, the greater the excessive effect of the key nodes. By analyzing these highcentrality nodes in chronological order, the authors can vertically compare the development history of data mining disciplines based on the journal papers and conference papers.
2.3.2.3 Keywords clustering analysis. Clustering analysis simplifies the intricate co-word network into several groups. The larger the cluster, the more keywords it contains. Through clustering analysis, the authors can directly understand the research hotspots in this field. Based on the similarity or relevance of knowledge, keywords with co-occurrence relations are reorganized into knowledge communities by clustering. The intricate co-word network relationships between many analysis objects are simplified into the relationships with a relatively small number of groups which are directly represented.
2.3.2.4 Keywords burst analysis. The burst detection algorithm proposed by Kleinberg (2003) can be used to detect a sudden increase in research interest of a certain discipline field. The keywords burst refer to a large change in the reference quantity of a certain keyword during a certain period of time, such as a sudden rise or a sudden drop. The beginning time of citation history can detect the development and evolution trend of a certain research topic in the related fields. In CiteSpace, if a certain cluster contains more burst nodes, then this field is more active or it is the emerging trend of research.

Keywords knowledge map analysis of conference papers
The authors use CiteSpace software to apply keywords co-occurrence analysis for the data mining papers in the above mentioned nine conferences during the past 10 years. The analysis time slice is one year, the node type is the keyword and the item selection criterion is "Top N = 50." The authors choose a minimum spanning tree and simplify each slice network and each integrated network.
The content indicated by these high frequency keywords shows that the research hotspots in the field of data mining based on conferences are mainly focused on artificial intelligence-based recommendation systems, information retrieval and data management. Besides, the results show that the keywords with high frequency are almost within the first several years (2007)(2008)(2009)(2010), which indicates that data mining research first focus on fewer themes than later. This is consistent with the fact that research on data mining began its fast development around this time.

Keywords betweenness centrality analysis
In the visual knowledge network map, the keywords nodes with purple circles are the nodes with higher centrality which is shown in Table 3. The keyword nodes with the high betweenness centrality are, such as "recommender system," "world wide web," "problemsolving," "user interface," "social networking (online)," and "social network." By analyzing these nodes with high centrality in order of the year, the authors can obtain the research development history of data mining discipline based on conference papers during 2007 and 2016: The first stage (2007)(2008): implementation of data mining classification and clustering algorithm. The second stage (2009)

Keywords clustering analysis
Clustering the high-frequency words, the research hotspots in the data mining field based on the conference papers during the recent years are obtained. According to the LLR algorithm, a cluster view is generated (as shown in Figure 2) and 23 clusters are generated (as shown in Table 4). The previous four clusters are used as examples for analysis. The results are as follows.
The average year is 2010, including 16 keywords, which are "factorization,"

Conference papers
There were 22 burst words from 2014 to 2016, including "big data," "network," "mapreduce," "system," "hadoop," "factorization," "state of the art method," "website," "real world," "active learning," "space division multiple access," "topic modeling," "matrix factorization," "iterative method," "complex network," "decision-making," "stochastic system," "feature selection," "matrix algebra," "sentiment analysis," "graphic method" and "markov process." These terms reflect that the research on Hadoop-based big data processing framework and the in-depth study of recommendation system are one of the discipline frontier research topics. In 2015, the keywords of community discovery, random walk and big data economic effect appeared in the co-occurrence analysis results, reflecting that the implementation of complex data analysis algorithms and the improvement of social media evaluation system are one of the discipline frontier research topics. In 2016, the keywords of recommendation system, cloud computing, feature selection and e-commerce appeared in the results of co-occurrence analysis, reflecting that the commercialization analysis based on big data and artificial intelligence is one of the discipline frontier research topics.

Conclusion
Compared with previous research work starting from the journal papers for the research frontier identification, in this paper, the conference papers were used as the analysis object. Based on the keyword co-occurrence network, from four dimensions of keyword term frequency, betweeness centrality, clustering analysis and burst analysis, this paper identified and analyzed the research frontiers of data mining discipline from 2007 to 2016. The purpose was to more accurately and comprehensively grasp the research frontiers of data mining based on conference papers. The clustering analysis focused on macro analysis. From the results of clustering, the conference papers focused on the infrastructure and application of big data. The keywords term frequency, betweenness centrality and burst analysis were from the microscopic point of view to analyze. The results showed that the contents of the conference papers were very diverse, focusing on applications and time-sensitive. The next step is to conduct a comparative analysis for the identification of the frontiers of data mining based on journal papers and conference papers.