Empirical Study on Citation Count Prediction of Research Articles

Citation is a measure that quantifies the impact of the researcher, research article and journal’s quality. Investigating the citation of articles and/or researchers is one of the important tasks in the research community. So, understanding and predicting citation patterns of research articles has become popular in scientific research fields. In this work, we give a machine learning approach to predict the citations of research articles using the keywords. We study the citation impact based on keywords motioned in the articles using the data set of publications which are published in the various physical review journals from 1985-2012. In this dataset, for each publication is allocated some PACS codes (keywords) by their authors which represent a sub-field of Physics. In this work, we are investigating the impact of PACS codes of article on article’s citation. We are performing our analysis on the first (sub-field of physics), second (sub area of sub-field of physics) and third level of PACS codes. We observed that compared to the first level, every pair of citation patterns of the second level is highly correlated. We also obtained a universal approximation curve for the third level that matches with the average value of the first level. This curve looks like a shifted and scaled version of the Gaussian function and is right skewed. We can also predict the citations based on the keywords by using this universal curve.


INTRODUCTION
In recent years, bibliometric and scientometrics indicators have been applied in the context of research evaluation as well as research impact more generally. Citation is a measure that quantifies the impact of the researcher and the journal's quality. It plays a pivotal role in a journal's impact factor, researcher's h-index and various other measures. Recently this measure is paying more importance in the research community. Researchers and scholars contribute many publications, so they seek to make an impact in their corresponding scientific research communities during their academic careers. Reflect on their talents and create new collaboration opportunities if their articles have the highest possible research impact. Recently, publishing papers that are highly cited are increasing due to the competitiveness of research grants and collaboration based on the researcher's impact. As an output, identifying the variables that impact paper citations has been investigated by the publishers and the authors of research articles.

Related work
Research articles will receive a good number of citations if the papers include important research topics and/or are relevant, popular and useful. [1] However, the diversity of topics and influence of reference can also affect the citation of an article. [2][3][4][5] The effect of the bibliometrics on the citation has been investigated because they are available and do not change over time. Although bibliometrics does not spill out the entire impact of citations and does not claim to identify causal relationships. [6] So, predicting citations of research articles has become popular in scientific research fields. Researchers need to study or investigate various factors of articles involved and it affects citation counts of the articles. Recently, the research community has accepted citation counts as one of the main measures of the impact of research. Nowadays, many countries consider citation counts in their national research evaluation practices such as the United Kingdom, [7,8] Australia [9] and New Zealand. [10] Many researchers discovered so far various factors (bibliometric variables) affecting citation counts. Yu et al. [11] studied that the authors, the journal, the research area, and the papers themselves affect the citation. In the paper, [12] they observed the effect of citation patterns based on the number of publications and number of citations of each author in the articles. Based on the APS Physical review database, the authors, [14] identified the relationship between authorship and citation, and analyzed those individual researchers cite their co-authors work more frequently compared to others research work.
In the paper, [15] they studied medium diversity papers that receive more citation compared with very low and high diversity articles. Authors introduced a measure paper potential index which is defined based on inherently quality of scholarly paper and the scholarly paper impact decaying over time, early citations, and early citers' impact. [2,16,17] They observed that paper potential indexes better interpret the changes in citation, without the need to adjust parameters. Similarly, they are many others also look at different factors of articles or authors affect their citations. [18][19][20][21] In the paper, [22] they investigated whether multi/inter-disciplinary research activities are correlated to impact of research and number of publications. The researchers explored many other factors which plays key role in the impact of citation such as year of publication, number of pages, number of authors, number of references, abstract length, keyword repetitions in abstract. [6,18,21,23,24] Since citation counts have many usages within academia and other fields, it is very important to study why one article is cited more compared to another. In this work, we investigate whether the keyword mentioned in the publication affects the citations. This study helps researchers, readers and editors gain an insight into the intelligent use of keywords in research publications to gain the number of citations. We also give various statistical results on citations based on individual keywords mentioned in the article.
Our Contribution: In this work, we are investigating the impact of PACS codes of an article on the article's citation. The analysis we are doing on the first (sub-field of physics) and second (more sub area of physics) level of PACS codes. The maximum number of citations reached within two or three years of publishing time for every sub-field and these citations reduce over a period. Similarly in the second level PACS code, we observed that some sub areas of physics receive more citation compared to others. We also observed that compared to the first level, every pair of citation patterns of the second level is highly correlated. We find the universal approximation curve for the third level that matches with the average value of the first level. This universal curve is a shifted and scaled version of the Gaussian function, and it is right skewed. We can also predict the citations based on the keywords of the paper by using this universal curve.

LITERATURE REVIEW
To predict the citations of a paper with the keywords used in the paper. Keywords used in the paper usually from subfields of physics which is given in Table 1. Based on the keywords mentioned in the paper we can classify the paper and we can predict how many citations will be received by the paper. Table 1 is level 1 classification of physics only. Here we give the decision tree in the machine learning technique. Let S be a set of samples, the percentage of class i is p i . The entropy is Where E is entropy. Partition the samples of S with the feature set F, and the information gain is: For each level, we do the analysis for citations which we will explain in next sections.

DATA SET
Physical Review journal-initiated publishing research articles from 1893 by the American Physical Society (APS). In subsequent years APS included journals according to the subfield of physics which are shown in Table 3.
In this paper, we considered published research articles from APS Physical Review Journals that are mentioned in Table 1 from 1985 to 2012 to investigate the PACS codes impact on citations. For each article, the data set includes title, name of the authors, publication date, author's affiliation, unique digital object identifier (article ID) and PACS codes (keywords used in the article and it represents which area the article belongs to). We have considered citations of Physical review articles from ASP journals along with metadata of various journal articles (it does not include citations received from other than APS journals). These data sets are requested and received from. The basic details of the data are given in Table 2.

Pacs Classification
A hierarchical classification of the PACS scheme indicates various fields as well as sub-fields of Physics up to five levels. It contains two pairs of digits followed by a pair of alphabet characters, separated by special symbol dots. For a simple illustration, in PACS code 02.10.Ab, the first digit 0 is for General Physics, the second digit 2 for Mathematical methods in physics, 10 for Logic, set theory, and algebra and Ab represents Logic and set theory. These codes are updated frequently over a time by the American Institute of Physics (some codes introduced newly, and some are removed). In this paper, for our analysis, we restricted up to the third level of the hierarchy of PACS codes (first four digits) since these represent all sub-fields of physics and are reasonably stable. In our future work, we extend our analysis to a higher-level hierarchy of PACS codes. PACS codes were announced and introduced in 1975 and people are using these PACS in their articles. But most articles published from 1975 to 1984 have not allocated any PACS codes because PACS code was introduced recently, and people are not aware of it ( Figure 1). We consider the period from 1985 onwards, as the amenability towards PACS codes jumped to more than 90% and has been consistently high since then.
In Figure 2, we show the fraction of researchers using various PACS codes in the articles published between 1985-2012. Most researchers have used only one to four PACS codes in their articles. Clearly, we can observe a power-law decay till PACS of 60, from thereafter, we see the graph is changing. The pattern of the plot follows multiple Pareto distribution. [25] Fraction articles using the different PACS code in Figure 3. It shows most of the article using less than six PACS codes. The number of PACS used in an article is very small. Clearly, we can see a maximum 22 PACS used in very few numbers of articles.
In recent days most of the researchers studied, [26][27][28] the data analytic part in relation to temporal variations. Wang et al. [29] provided log-normal distribution about maximum citations that receive at age in (Wang, Song et al. 2013). [28] Enduri et al. analyze the citations of article's pattern over time and their correspondence with PACS codes. [15] An average number of citations of the papers during the year 1985-2012 is shown in Figure 4. Within the first two to four years a maximum number of citations was reached and later drastically decreased.
After fifteen to twenty years the citations of the paper will be negligible. This analysis is done by the authors on the published articles from 1998-2006 in Spanish Psychology journals on the web of Science. [30] In this work, we investigate the impact of PACS codes (key word or sub-field of physics) on publication citation. We analyze and predict citations received for PACS code which is included in the research article.

RESULTS
In this section, we show different analyses on citations received by the different levels of the sub-field in physics over 28 years from the year of publication of the research article.
We investigate the impact of PACS codes of the article on the article's citation. Our analysis is based on the first (subfield of physics) and second (sub area of sub-field physics)   level of PACS codes. So, we consider first level PACS code information for each paper and find out its citations. So, we can analyze the impact of PACS on citation by analyzing citation information for each level of PACS code.

First Level
The First level of PAC code mapped with its Sub-fields for physics is given in Table 1. The first level of PAC code ranges from 0 (mapped to General physics) to 9 (mapped to Geophysics, Astronomy, and Astrophysics). The number of citations obtained per year wise for each PAC over a period of 28 years is given in Figure 6.     using heat-map. The high positive value of correlation indicates that there is a high degree of linear variation in the positive direction. It is observed that physics sub-fields with PACS 20 (≥90%) (Nuclear Physics) and PACS 70 (Condensed matter: Electronics structure) are highly correlated. It is also observed that the Condensed matter sub-field is highly correlated (≥85\%) with Atomic and Molecular, Electromagnetism, Physics of gasses, Condensed matter, Interdisciplinary Physics sub-fields of physics (with PACS codes are 30,40,50,60 &80). The Geophysics (PACS 90) sub-field is low correlated with the condensed matter sub-field (PACS 60) (around 67 %) and with General Physics (PACS 00) (around 60%). We can infer that most of the people who cite sub-fields with PAC codes publications. within two or three years of publishing time for every sub-field and these citations reduce over a period.
The mean, median, Q 1 , Q 3 values of citations for all the subfields are plotted in Figure 9. It is observed that mean of citations is always more than the median of citations for all sub-fields and hence histogram of citations for all sub-fields are right skewed. Hence, over the years citations will reduce for all sub-fields. From the Physics of gasses (50) sub-field to Interdisciplinary Physics (80) the values of mean and Q 3 are almost close to each other that specifying a more significant number of citations after the 21 st year of publication.
We have obtained the following Distribution plots as Figure 8 shows the correlation among the citations of physics sub-field    other than 70 also cite the sub-field with PAC code 70 shown in Figure 5 for the sub-fields of physics. It is observed that sub-fields General (PACS 00), Electromagnetism, Optics (PACS 40), and Condensed matter (PACS 70) have almost looked like symmetric (Gaussian curve) but with different means and spreads (variance). A similar inference we can obtain from box plots is shown in Figure 7. All other PACS have skewed distributions in either positive or a negative direction. It is observed that condensed matter sub-field have a larger mean as compared with other sub-fields. Interestingly it is observed that for Condensed matter: Structure, Thermal and Mechanical properties (PACS 60), the distribution seems to be bimodal with a lower peak appearing around 53550 and a higher peak occurring around 53700. Also, it is observed that the spread around higher peaks is larger than that of smaller peaks.

Second Level
The second level of PACS code mapped to the sub area of the physics which can be seen in. The second level means we need to consider up to the second digit in the PACS code and it ranges from 00 to 99. We have a maximum of 100 sub areas and the really assigned areas in physics is 68. Some two digits out of 100 were not assigned or a negligible number of citations in 28 years. So, for 68 sub areas, we observed the citations of 28 years of publications which these subareas (PACS codes) mentioned in their publications. Here also we can observe that the same behavior of citations reached a maximum within two or three years of publishing time for every sub-field of physics which is shown in Figure 5. The intention of going to the second level can investigate the impact of citations based on the sub areas mentioned in the research article.
The mean and median values of citations for all the subfields of the second level are plotted in Figure 11. Most of 50% of citations are below the average for this second level and very few times, 50% of citations are equal to the mean. The PACS codes 71 (Electronic structure of bulk materials in physics), 74 (Superconductivity) and 75 (Magnetic properties and materials) receiving more citations compared to other PACS codes. Other than this sub-field of condensed matter, 12 (Specific theories and interaction models) and 05 (Statistical physics, thermodynamics, and nonlinear dynamical systems) are capturing more citations. In the sub-field of physics, 28 (Nuclear engineering and nuclear power studies), 39 (Instrumentation and techniques for atomic and molecular physics), 45 (Classical mechanics of discrete systems), and 92 (Hydrospheric and atmospheric geophysics) received very few citations. This is due to a smaller number of articles or authors published in this area and very less attraction of these articles within these areas.  We have shown box plots of citations for all the sub-fields of physics of the second level in Figure 10. The sub-field of physics 75 (Magnetic properties and materials) receives more citation compared to other PACS codes. The sub-field of physics 04   Figure 14: Reduced χ2, A/w versus Sub-fields.

Third Level
The third level of PACS code is mapped to sub areas of subfields of the physics which can see in [24]. Third level means we need consider up to four digits in the PACS code and it ranges from 00.00 to 99.99. We have a maximum of 1000 sub areas and really assigned areas in physics is 944 in the third level. Some two digits out of 1000 were not assigned or a negligible number of citations in 28 years. So, for 944 sub areas, we observed the citations of 28 years of publications which these subareas (PACS codes) mentioned in their publications. For each PACS code, we look at citations of the paper in which the paper has the PACS code. Consider the first 28 years of citation of this paper including it to the PACS code. For each PACS code, we can have a citation pattern. Here also we can observe that the same behavior of citations reached to maximum within two or three years of publishing time for every sub-field of physics. The Intention of going to the third level can investigate the universal pattern of citations based on the sub areas mentioned in the research article. Figure 13 describes plots of citations with respect to years for all the sub-fields starting with codes 00−90. Each The universal curve is obtained using a shifted and scaled version of the Gaussian curve (given in the below equation) which is right skewed Gaussian. Standard deviation w species the width of the universal curve. Table 4 gives details about the universal curve for every sub-field of physics. Reduced χ2 specifies the goodness of fit and it should be low. It is also observed from Figure 14, that Reduced χ2 and the ratio are proportional to each other.
From Figure 14, it is observed that sub-fields 10 and 70 have a maximum value of reduced χ2, A/w indicates that even though these sub-fields have many dominant sub-fields, the universal curve approximates that of an average value of the first level.  Figure 14 that a universal curve (sub-field 70) with a lower standard deviation (w= 5.92) has narrow width (that reduces quickly from maximum) as compared with that of a higher standard deviation (w= 0.75) has wider width (sub-field 30) where the curve reduces very slowly from the maximum value. The reduced χ2 and A/w is lower for the sub-fields 50, 80 specifying that universal curves for these sub-fields have better goodness of fit than compared with other sub-fields. If we get any paper in the sub-field, we can predict the first level citation with the help of the corresponding universal curve for that sub-field derived in the 13.

CONCLUSION AND DISCUSSION
In this work, we are investigating the impact of PACS codes of the article on the article's citation. The analysis we are doing on the first (sub-field of physics) and second (more sub area of physics) level of PACS codes. More citations are received by the paper from condensed matter sub fields and the paper on the Physics of gasses got the least number of citations. The maximum number of citations reached within two or three years of publishing time for every sub-field and these citations reduce over a period. Similarly, in the second level PACS code we observed that some sub areas of physics receive more citations compared to others.
We observed that Condensed Matter: Structure, Mechanical and Thermal Properties (PACS 60) has bi-modal distribution, General (PACS 00), Electromagnetism, Optics (PACS 40), and Condensed matter (PACS 70) has distribution almost looks like Gaussian with different means and spreads (variance).
We also observed that compared to the first level, every pair of citation patterns of the second level is highly correlated. Most of 50% of citations are below the average for this second level and very few times, 50\% of citations are equal to the mean. We are investigating third level PACS codes with respective citations and how it is correlated. Our future goal is if we give an input PACS code or set of PACS codes then what is the citation pattern for these PACS codes or set of PACS codes. We can also investigate the pattern of the correlation between the third level to the first level. We also obtained a universal approximation curve for the third level that validates with the average value of the first level. We can also predict the citations based on the keywords of the paper by using this universal curve.

Limitations
In the data set, we studied up to the third level of PACS codes. If want to apply the same techniques to other data sets then we need to have the data for each paper with a specific correlation with keywords similar these data sets in our paper.
Getting the data is one of the challenging tasks for this kind of research work. We have predicted the citations for each subfield of physics with maximum accuracy of 88%. One of the challenging tasks is to improve this accuracy by exploring the depth of the keyword analysis and more data. We recommend for this kind of research work data sets with more correlation of papers and keywords so that we can get more accurate results.