Recruitment Boosted Epidemiological Model for Qualitative Study of Scholastic Influence Network

Measuring the true influence of a researcher over the past few years has been an important problem in field of scientometrics as it not only facilitates funding organizations, academic departments, and researchers but also indicates the impact of scholarly influence. The existing author level ranking metrics such as h-index, measures of citation counts, can ignore much of the nuance and are also often criticised for being unfair owing to its purely quantitative approach. In this paper we propose an influence diffusion model based on the Epidemiological model variant called as Recruitment boosted SEIR. Our model tries to simulate the spread of infection with the growth of influence of a researcher by remodeling various existing parameters and building a new concept for qualitative study of prolific authors. Finally, the reproduction number is derived, and the scores are computed. To validate our influence diffusion model, we perform experiments on the real author dataset collected from web of science and compare the researchers influence with their paper’s citation counts and h-index. Finally, we analyses the patterns about how researchers’ influence ranking increased over time. Our studies also show the various changing patterns of researchers between different h-index.


INTRODUCTION
Capturing and measuring the true scholastic influence of an author/scholar's work has gained huge momentum owing to the breeding culture of "publish or perish" in most academic institutions. The major trade-off in this scenario is quality and relevance of the research work done. The existing bibliometrics ranking techniques are more focused on quantitative entities like citation counts, papers published etc without considering the impact and outreach of these research articles. A lot of work is being done in the field of author influence diffusion techniques using graphical measures like tree structures or graphs. In [1,2] the authors have tried to create an influence diffusion tree that uses multiple-path asynchronous threshold (MAT) model for viral marketing in social networks. The authors have quantitatively measured influence and kept track of its spread and aggregation during the diffusion process. The MAT model captured both direct and indirect influence, depth-associated influence attenuation, temporal influence decay, and individual diffusion dynamics. Another major advancement in this domain has been by using the epidemiological models. These infection models are found to be efficient in capturing the spread and impact of a disease. [3] Mathematical modeling of infectious diseases is used to predict the transmission and the outcome of the diseases, which helps to provide possible counter measures to reduce the mortality rate or to eradicate the diseases. Inspired from such infection modeling, we have considered amoebiasis disease modeling to replicate the scientific growth and impact of a researcher. The populace or set of articles published by an author is partitioned into compartments, with the supposition that each person/ article in a similar compartment has similar attributes. The SEIR model [4] has an extended state as "exposed" which reflect the postponement between the obtaining of contamination and the infection state. [5,6] The spread of infection is suitably interpreted as the influence spread of an author through his/her articles. The existing author level metrics like h-index and publication count indicate the quantitative aspect of researcher's growth. [7] In this paper we have tried to compare these existing metrics with our influence scores using [8] reproduction number. To study and compare the effect of the SEIR influence diffusion model, we have considered the various levels of authors and their growth trajectory. The following are the two categories namely:

Celebrity Authors-Researchers with higher h-index
(greater than 50) The problem we try to address here is that of authors with more citation counts are called as influential authors, thus making the scholastic influence increasingly difficult to measure. The need to measure the true impact of scholarly work is achieved only when the true value of each citation is considered. We are aware that differentiating citations based on the source might seem highly conservative to many but by filtering the citations at the source, we intend to create an unbiased evaluation process. Our study is motivated by the existing gap in scholastic evaluation methods. We propose to discuss and address the following questions:i. Can citation count alone capture the quality of an author's work? How can we use it for individual diffusion analysis?
ii. Are all citations equal? Which citations indicate the global influence of an author's work?
iii. How do we determine the impact of research to be considered important?
iv. Can h-index and citation count truly justify the impact of an author?
The rest of the paper is organized into sections like motivation to this work followed by related work done so far in this field. The preceding sections discuss the author influence model in detail with data collection to data modeling, results and finally validation.

MOTIVATION
In recent years, the research community has started to explore different and innovative metrics that can define the" Scholastic Influence" for scientific and research assessments. Though citation analysis is one of the most popular technique used in scientometrics, [9,10] this method involves counting the number of times a paper is cited. The underlying assumption is that the most influential researchers and their work will naturally receive more citations. A recent trend shows that many researchers have used these models to underplay the power of citation count by collaborating with peer groups or forming a network of known authors who coercively cite each other and boost their citations eventually. [11] It is observed that there exists a lacuna in existing Author Level Metrics (AL Metrics) that measures the impact of an author by considering the citation counts such as h-index, i-10, etc. [12] The main problem with these approaches is that the metrics are susceptible to artificial inflation of citation counts through community citation, self-citations, etc [6] that could cause the metrics to show incorrect or misleading data pertaining to the author's influence. This could lead to considering an author as substantially impactful or influential even though that may not be realistically true. [13] With the use of the epidemiological model the authors propose to overcome this drawback by considering other factors such as source of the citations received to make a better prediction of the influence of the author and make interesting revelations about the epidemic nature of the authors' influence.

RELATED WORK
A lot of researchers have tried to rank authors based on various metrics. In paper, [14] the authors have used a rating approach that uses citations in a dynamic fashion, allocating rating by considering the relative position of two authors at the time of citation. The main objective of the paper was to introduce the notion of citation timing for relative ranking among the authors. The authors in paper [4] have proposed a novel epidemic model, called as CISER model, for message propagation in DTN, based on Amoebiasis disease propagation in human population. The paper discusses in detail the various analogies of DTN and how the CISER model is instrumental in message propagation in many such networks. The authors have indicated the role of each compartment in the epidemiological model with derivations for each state using differential equations. In another paper [15] the authors have implemented topic diffusion in web forums is modeled using the epidemiology model. The SIR model was adopted for the web forum. The model was evaluated on a large longitudinal dataset from the web forum of a major retail company and a dataset from a general political discussion forum. The experiment results revealed that the SIR model performed well in modeling topic diffusion in web forums. [16] A lot of researchers have tried to work on scientometrics and in paper [17] statistical characteristics of authors, co-authors, references, and citations are used to reveal the structure and dynamics of the research community and the intellectual environment of the field. The paper [18] discusses scientometrics as quantitative (mainly statistical) study of any measurable aspect of scientific activity with the aim of understanding and, if possible, improving its operating mechanism. The authors have divided scientometrics as structural scientometrics whose purpose is the mapping of the structure of scientific communities, sets of documents, ideas, etc. [19] Its typical techniques are, among others, graph theory, network analysis, cluster analysis. [3] Whereas dynamic scientometrics purpose is to describe the space-time behavior of scientific information through scientometric objects (authors, publications, citations, etc.). Its typical methodological tools are ordinary and partial differential equations, stochastic models, and computer simulations.

PROBLEM STATEMENT AND OBJECTIVE
As already explored in the previous section, there are numerous metrics to measure the influence of an author in the public domain. However, these metrics rarely reflect influence of a researcher's work. It is not beyond doubt that such metrics are effective universally, but a metric which is a descriptor of scholastic influence, would be handy for various agencies to evaluate researchers outside their peer groups and may further facilitate research assessment for outstanding credentials. We imply that an author may thrive by either receiving citations from pure strangers (aliens) or may depend on the support of the ecosystem of known collaborators. We would like to mention that in either case, we are not judgmental of the author or their citations. Nonetheless, there is a strong case for evaluating the author's influence spreading beyond known boundaries. This is a rare quality that science academics and agencies look for. We intend to bridge existing gaps in scholarly evaluation by addressing the following issues to be considered for our experimental study: i. Can we define a parameter given that field normalization has been done across research domains in computer science, minimizing the possibility of skewness/bias in citations that will compute the extent of outreach of research scholars?
ii. Is there any way other than the existing ones that will determine the potential for independent research in early career of scientists?
iii. Can we qualitatively measure the spread and growth of a researcher given the citation data?
We answer these questions positively, in the manuscript via theoretical model and empirical evidence. Citation count is often used as a metric to rank authors and their work. The authors propose another new approach as a natural consequence of such an exercise by tracing the source of citations and calculate the various parameters by deriving formulas and equations for the proposed epidemiological SEIR model. Epidemiological models are deterministic models that can be suitably modified to explore the dynamics of citations flow in an author's network. The epidemiological models have the following variants i. SIR model is an epidemiology model that studies the rate of infected people in a closed population over a fixed time.
The model compartments three stages namely S as number susceptible, I as number infectious, and R as number recovered (immune).
ii. SIS Model where there are two states only as susceptible and infected.
iii. SEIR Model has an additional state called Exposed which indicates that the infection is local and still not in infectious state.
For our problem domain, we selected SEIR model to facilitate the citations movement based on source of citing articles.
Initially all articles published by an author is considered as susceptible. Based on the type of citation received the transition of articles begin from S to I or S to E. The removal of articles is based on no citations received for over a period of three years consecutively. The next section explains the model formulation for the same.

Data collection and Modeling
The complete methodology adopted is shown in Figure 1 below. One of the major tasks was to collect data of various authors and store them for further analysis. The data was collected from Web of science using the advanced search option for authors in field of computer science.
Next using the filter option, we set the years of publication as 2010-2017 for each author. The downloaded author-data was in comma separated-value format (CSV) which contained the list of all the papers belonging to an author along with the citations each of his/her papers received is shown in Figure 2 below.
The citation report option had further detail about the paper of the author in consideration being cited, the author of the citing paper, the university the author belonged to and the year of publication along with h-index and total publications. We prepared a list of 35 authors with various h-index ranges  1. Structural scientometrics. Its purpose is the mapping of the structure of scientific communities, sets of documents, cognitive ideas, etc. Its typical techniques are, among others, graph theory, network analysis, cluster analysis.
2. Dynamic scientometrics. Its purpose is to describe the space-time behavior of scientific information through scientometric objects (authors, publications, citations, etc.). Its typical methodological tools are ordinary and partial differential equations, stochastic models, and computer simulation.
Epidemiological models in scientometrics have been used t o describe the dynamic behavior of authors and citations. [20,21] In our approach, we have concentrated on the SEIR model variant. This is because of the extended class 'Exposed' that suitably models our requirements of segregating citations based on the source. The flowchart of the SEIR model is given in Figure 3. The ovals in the flowchart represent the different compartments of the infection model namely Susceptible (S), Exposed (E), Infected (I) and Removed (R). The transition from one class of nodes to another is represented with the help of arrows and the rate of transitions are indicated.
Next the data collected as discussed in previous section is fed into our model as shown in Figure 4 with each state defined as follows: -1. Susceptible (S): The susceptible class represents the set of published articles by an author in a particular year.
(less than 10, greater than 50) and downloaded data accordingly. The data collection steps are explained as follows: 1. In the first step we select an author 'i' who we are trying to analyze.
2. In the second step we are taking in the data of the articles citing the work of the author 'i' and call the set of all these articles as 'Citing Articles'.
3. In the third step we segregate the citations received year wise and compile the list of author-citation data year wise as shown in Table 1.
4. Next each citation is further investigated based on the source of the citation. We calculate how many papers of the author 'i' are going from one state to another, i.e, Susceptible to Exposed, Susceptible to Infected, Susceptible to Recovered, Exposed to Infected, Exposed to Recovered and Infected to Recovered. Each of these states has been described in next section and holds the same meaning.

5.
In the final step we calculate the values of the various coefficients by taking the average of each of the transitions.

Transition from Original SEIR Model to Our Influence Diffusion Model
Scientometrics is often considered as the quantitative study of any measurable aspect of scientific activity (mainly those reflected in the scientific literature), with the aim of understanding and, if possible, improving its operating mechanism.
There are two variations that are proposed in. [20]  differential equations based on the Initial Value Problem (IVP), in the following general form. (2) Where is a vector which defines the size of the composition of different classes of nodes. The size of the above classes' w.r.t to the proportions of the network can be expressed as in equation (2) where, N is the total number of the nodes in the network. Let 1/λ, 1/γ and 1/µ be the average periods of time a node remains in Susceptible, Exposed and Infected classes, respectively. Assume λ to be infection rate at which the susceptible nodes acquire the infection. Assume that p is the probability for an exposed paper to become an infected paper (if cited by an alien author). Also, (1-p) is the probability that an exposed paper becomes re-covered (not cited for 3 a period of 3 years consecutively). We also assume that no new papers are written from the time the citations are being counted. A complete list of all the parameters used in the equations are listed below in Table 2. 1. The published articles of a researcher are in the susceptible state where any citation received from local network of an author indicates a transition of the articles to Exposed class with a rate of β. Similarly, another portion of the susceptible papers which are cited by alien authors and with a rate of 'α' move to the infected class of papers. Certain papers may not be cited at all for a period of 3 years since the paper has been published. These papers with a rate of ε move to the recovered class. Figure 4 shows the complete transition of the above. So, the rate of change in size of susceptible papers can be represented in terms of following differential equation (3) Using the principle of the law of mass action.
2. The susceptible portion of papers infected move to the class of exposed, they will remain in this class for a period of the duration 1/γ. The next transition is either to infected state at 2. Exposed (E): The exposed class (E) represents the set of articles which have only received local citations, which means citations from authors belong to the same institute/ organization or community as that of the published author.

Infectious (I):
The infected class (I) represents the set of articles which have been cited by alien authors, meaning not belonging to the same institute or organization.

Removal (R):
The recovered class (R) represents the set of articles which were not cited by any authors for a period of three years.
The size of the above classes w.r.t to the proportions of the network can be expressed as follows:

Vital Dynamics of SEIR Model: Parameters and Equations
Once the structural model was ready, next we start the process to describe the dynamic behavior of our model with the parameters redefined. The citations for the data collected is set as:-Initial condition (t=0) is set to year 2010 and time span is 8 years that is until 2017. The dynamism of SIER model for author citation can be represented using a set of nonlinear Where k is the constant of integration which is equal to 1at the initial time t = 0. Epidemiologically, we can conclude that the size of the population under consideration does not change during the whole period of author citations.
All states of our model are now defined with equations and the model is balanced as well.

REPRODUCTION NUMBER: AN INDICATOR FOR INFLUENCE DIFFUSION
In the literature of infectious diseases, the basic reproduction number R0 is the determinant parameter for the spread of the disease. The most important uses of R0 are determining if an emerging infectious disease can spread in a population. This property is suitably used in our model to simulate and measure the spread of scholastic influence. As far as the threshold conditions of influence propagation is concerned, to spread out, the basic reproduction number must be greater than 1, otherwise influence propagation will die off, i.e., R 0 > 1.
The derivation of R0 for our model begins by identifying the various factors affecting the spread of in-fluence. The spread of influence of an author depends on the average rate of citations received from alien sources, duration of the infection and transmittable rate. This is suitably modelled as: - More specifically: Where τ is the transmissible (i.e., probability of infection given contact between a susceptible and infected individual), c is the average rate of contact between susceptible and infected individuals, and d is the duration of infectiousness. We can a rate proportional to gamma with a probability p or some of them may not be cited for a period of 3 consecutive years and may go to the re-covered class at a rate proportional to (1-p)/gamma. It follows that the equation expressing the rate of change in size of the exposed proportion of the population is: 3. The average life of infected papers is taken as 1/µ, so the rate with which a paper goes from infected class to recovered class is given by µ. The equation for rate of change in size of infected paper is given by the average life of infected papers is taken as 1/µ, so the rate with which a paper goes from infected class to recovered class is given by µ. The equation for rate of change in size of infected paper is given by: 4. As we have described earlier, if a paper in any of the class has not been cited for a consecutive period of 3 years, it'll move to the recovered state. The 'cR' is the rate at which papers are removed from the model. The differential expression for Recovered state is given by: The probability of an exposed paper going to infected class has been assumed as a small value of 0.001. The recruitment rate is equal to the rate at which the paper leaves the recovered state R. We assume that this is periodic in nature.
These four ordinary differential equations coupled with the flowchart form a system that governs the dynamics of author citation. A solution to this system is a vector function that provides, at any time t, the coordinates of the point in fourdimensional space whose components are expressed in terms of sizes of susceptible, exposed, infected, and recovered respectively. From the system of ordinary differential equations, it follows that: Let the sum of the 4 ordinary differential equations be given by S 0 : - The number of papers moving from susceptible state to infected year wise is as follows: -The value of α is obtained by dividing total papers moved from susceptible to infected state by the number of years. So, the value obtained is 24/3=3. Similarly, all other coefficients are also derived and stored in the csv file as shown in Table 4.
The reproduction number is found to be unaffected by the number of papers which indicates that any researcher with higher publication count and h-index does not ensure that his/ her influence will be high. The R0 values in Table 5 are consistent with the ideology of the model which is built to place authors at higher ranks, if their eminence is accomplished without the aid of external influence or collaboration of any kind. Certainly, Friedman M and Hawking J deserve to get the top 2 positions. The authors are very well recognized now define the variables on the RHS of the equation (12) in terms of the various parameters shown in Table 2 as follows: - Here the probability of the infection given that contact happens is found by adding and then taking inverse of the rates of the two ways a paper can get infected that is by going from Susceptible to Infected State and Exposed to Infected State as shown in equation (13) 3 c β α γ Here we need to take the average of the rate of contact between susceptible and infected individuals which includes rate of Susceptible to Infected, Exposed to Infected as well as Susceptible to Exposed as shown in equation (14). We also consider Susceptible to Expose as getting exposed increases the papers chance of getting infected.
We now use the Equations 13,14 and 15, to derive Reproduction Number R0 as: - Therefore, R0 for our model is given as:

VII. RESULTS OBTAINED AND VALIDATION
We construct our SEIR influence diffusion model for each of the 35 authors. The citations of each author are investigated and accordingly the states and transitions are made. The value of each coefficient α, β, γ, µ are derived using the Algorithm 1.
For example, as shown in Table 3, for an author with 24 papers we carefully analyze per paper/citations and year wise the movements.   The SEIR curves are plotted for various authors with high h-index, low h-index and a comparative study is performed between Infection rate vs h-index.
The Figures 5 and 6 clearly indicate that the infection rate in both the authors are found high as many citations where received from outside community. Reproduction number is a clear indication of spread of infection and here it is researchers in computer science domain. It is remarkably interesting to explore the R0 values attained by Dane D and Olivier T. They have reasonably higher R0 scores since their infection rates are higher than many authors with higher h-index. Also seen is author Buyya R the number of papers is 299 and R0 is 169 indicating his work is influential and epidemic in nature. On the other hand, Alkydiz K has high h-index and R0> 1 indicating epidemic yet less influential. Table 6 highlights few authors whose R0 value is compared with h-index and number of papers published. It indeed is interesting to note that author Olivier T has just 3 publications but the R0 is high owing to high infection rate and low removal rate. These authors are suitably termed as" Rising star".     independent of the h-index or the publication counts. The rate of infection is higher in these authors and they are called as super spreaders. With our analysis using SEIR model it is significantly found the even researchers with less publication rate can have high infection rate and become influential. Figure 7 clearly shows that there no relation between R0 and h-index or Number of papers published. Our model appropriately indicates that influence of an authors must be qualitatively measured.

CONCLUSION
The field of bibliometric has developed amazingly through the span of decades, implying that researchers are now dealing with more tougher questions. The assumption involved is that the citations received by a publication all have the same value when some of the citing publications would clearly have impacts different than others. The question becomes how such a convention could be set aside, making way for a new paradigm. Our effort is to clearly demarcate between citations within community and outside community. These citations are then suitable named ex-posed and infected states. The mathematical modelling of transmission of infection is suitably used to create a new model that can substantially work on" influentiality" as a parameter for observing an author's growth. The work focuses on quality of citation rather than the number of citations received. This ensures that all the researchers work is weighed equally without any biases and quality is the paramount winner. The contribution of this paper could lead to several consequences like 1. It could help identify suitable candidates for national and international research awards based on R0 scores.
2. It could help determine interdisciplinary Influence of a scholar.