Statistical Indicators of the Scientific Publications Importance: a Stochastic Model and Critical Look

A model of scientific citation distribution is given. We apply it to understand the role of the Hirsch index as an indicator of scientific publication importance in Mathematics and some related fields. The proposed model is based on a generalization of such well-known distributions as geometric and Sibuja laws included now in a family of distributions. Real data analysis of the Hirsch index and corresponding citation numbers is given.


Introduction
In theory, a rather large number of indexes are proposed, which supposedly measure the significance of the scientific publications of an author. Among the most popular of them should be noted: i1) the total number of citations of a particular author; i2) Hirsch index of the author.
It is these two indexes that we consider in the proposed work.
The definition of the numerical value of the index i1) is clear from its name.
Recall the definition of the Hirsch index (see [1]). The Hirsch Index h is the number of articles that have been cited at least h times each. This index was introduced in [1], where its properties were explained. In our opinion, these do not correspond to the index purpose. However, we dwell on the description of both the positive and negative sides of the Hirsch index after constructing citation models for scientific articles. One of them has already been stated by us in preprint [2].

Citation model construction
We now turn to the construction of the authors citation model. It will be considered as a composite of two models. The first of them describes the process of publishing an article by one author which will be cited, and the second describes the process of citing of such article.
Let us make some assumptions, which we discuss later.
Assumption 2.1. Let the probability of rejection or non citing of the manuscript be q and the decisions on publication of different manuscripts are taken independently.
Then it is clear that the probability that the scientist will have k cited papers equals to q(1 − q) k , k = 0, 1, . . .. In other words, the random number of publications of a scientist has geometric distribution with parameter q. Generating function of this distribution has form Of course, here we assume that all journals to which the author sends manuscripts have the same review system, i.e. all of them accept the manuscripts of this author with the same probability 1 − q. More realistic is the situation with a random parameter q: where Ξ is a probability distribution on [0, 1] interval.
It is natural to assume that each cited publication will produce some number of citations. Of course, the likelihood that the article will be quoted again depends on the number of previous citations.
Assumption 2.2. Assume the probability that an article having k−1 (k ≥ 1) citations will not have new quotes equals to p/k γ where p is the probability that the article will not be quoted for the first time.
Consequently, the likelihood that the article will be quoted exactly k times equals p/k γ k−1 j=1 (1 − p/j γ ). For the case of γ = 1, the generating probability function for the number of citations of this article is 1 − (1 − z) p . Corresponding distribution function is named after Sibuja. Below we consider the case of arbitrary positive γ. Corresponding study has general mathematical interest. Therefore, we provide it is a number of sections below.

Distribution of citation number of a paper
Let us consider an ordered sequence of experiments {E n ; n = 1, 2, . . .}, where an event A may appear in each of the experiments with the probability p n . Define a random variable X as the number of the first experiment in which A appears. We suppose that X is improper random variable in a sense that it may take infinite value (that is, the event A will never appear). For the case IP{X = ∞} = 0 say X is a proper random variable. It is clear that and Particular cases are: 1. The probabilities p n = p are constant. So (3.1) is corresponding to classical geometric distribution. Its tail is Clearly, the tail and probabilities (3.2) decrease exponentially as n tends to infinity.
2. The probabilities are assumed as p n = p/n, where p is a number from (0, 1) interval. The equation (2.1) is transformed to According to (3.3) X is a proper random variable and has, in this case, the Sibuja distribution with parameter p ∈ (0, 1) with the following tail having heavy power asymptotic for m → ∞. Such the distribution does not own the finite mean value. It is not difficult to see that The presented distributions can be respected as a kind of extreme points from the perspective of the tail behavior for proper random variable X. Hence, it makes natural to study roughly speaking the cases happening between them; namely to consider, for example, the situations when p n = p/n γ , with p ∈ (0, 1) and γ > 0.

Main result on citation number distribution
The research subject is in the asymptotic behavior of the probabilities (3.1) for p n = p/n γ with γ ≥ 0. Additionally, to the discussed earlier values of γ = 0 or γ = 1, we distinguish the following two cases: Let us consider the case A). We have Consider the product from right-hand-side of (4.1) in more details.
has a finite positive limit as n → ∞. This limit may depend on p and γ. Let us denote it by C 1 = C 1 (γ, p). Therefore, For 0 < γj < 1 the following asymptotic representation is known where ζ(u) is Riemann zeta function. Further consideration are dependent on some properties of the number γ.

If 1/γ is not positive integer, then
where C 4 depends on p and γ. Similarly, for the case of integer 1/γ, Let us consider the case B). We have where γ > 1. Transform the product in the right-hand-side: The series under exponential sign converges because γ > 1. From latest relation we see that and X is improper random variable. Therefore, for conditional probabilities we have IP{X = n|X < ∞} ∼ C 6 p n γ as n → ∞, (4.14) where C 6 depends on p and γ only. Summarizing, we obtain the following theorem Theorem 4.1. For the considered experements scheme with probabilies given in (4.1) the following statements are true: • If γ = 0 then IP{X = n} = p(1 − p) n−1 , n = 1, 2, . . ..

Comments
Theorem 4.1 shows that for 0 ≤ γ < 1, the tail of the corresponding distribution is not heavy. Namely, the distribution has finite moments of all positive orders. However, the tail becomes heavier with growing γ ∈ [0, 1). In the case of γ ∈ [0, 1] the distribution is unimodal with mode equals to 1. For the values γ ∈ [1, ∞), the distribution has a power-type tail, which is heavier than the ones occurring for γ ∈ [0, 1). In the case γ ∈ [1, 2) the conditional distribution under condition X < ∞ does not own the finite mean. However, for growing values of γ ∈ [1, ∞) the tails of conditional distributions look to be less heavy. In the case of γ ∈ [1, ∞) the conditional distribution has mode at 1.
6 The case of growing p n Above, we considered the case of the probability of event A decreasing with increasing experiment number. For completeness, consider the case of an increase of this probability. Namely, suppose that in (2.1) p n = 1 − q/n γ for q ∈ (0, 1) and γ > 0. Then It is clear that IP{X = ∞} = 0, and the tail of the distribution T m = q m−1 (Γ(m)) γ is a quickly decreasing function of m. Of course, distribution of X has finite moments of all orders and it may have mode not at 1 only.

Back to the distribution of citation number of one author
We suppose now that the distribution of citation number of one paper has the form (4.1): with γ > 0. Corresponding probability generating function is As it was mentioned above, the number of cited paper is distributed according to geometric law with probability generating function (2.1): The probability generating function of citation number of one author equals to the composition of P and Q, i.e. it is P(Q(z)). It is clear that the tail of corresponding distribution is not heavy for γ ∈ [0, 1), it is heavy for γ = 1, and the distribution is improper for γ > 1.
Although, the case of improper distribution seems to be not realistic we discuss it for some particular cases below, after consideration of proper cases γ ∈ [0, 1].
Let us remind that the case γ ∈ (0, 1) leads to the light tailed distributions while γ = 1 -to the laws with heavy tail. The choice between models with light or heavy tails can only be made based on real data. Below we analyze some data of this kind.

Analyzing data from Scholar Google "Mathematics"
Let us give the data for part "Mathematics" on February 16, 2020 (see Table  7.1). The data given concern the first 10 in the number of citations of authors. We do not give the names of these scientists. The  Table 7.1 shows the first scientist has 2.76 times more citations than the second. In other words, maximum of observations is essentially greater than previous one. This observation leads us to think that the corresponding distribution has heavy tails (see [3] and [4]). As we have seen, it is possible for the case γ = 1 only. Because we have a limited sample size, it is possible as an approximation for the case of γ close to 1 (but less than 1).

Analyzing data from Scholar Google "Biostatistics"
Let us give the data for part "Biostatistics" on February 16, 2020 (see Table  7.2). The structure of Table 7.2 is the same as that of  Table 7.2 shows the first scientist has 1.59 times more citations than the second. Although it is not so many as for Table 7.1 but this number is large enough to support our hypothesis on the presence of a heavy tail.
We do not give the data on the part "Statistics" but mention the situation is similar to that of the Tables 7.1 and 7.2.

Final model for the distribution of citations
From the considerations of the two previous subsections, it follows that the most natural way to describe the distribution of citations is to choose γ = 1. This means and probability generation function of citations distribution has form Denote by Y the number of citations of a given scientist. It is clear that IP{Y = n} may be found as nth coefficient of expansion R(z) in power series. We have where 2 F 1 is a hypergeometric function. Therefore, It is possible to verify that IP{Y = 0} > IP{Y = 0} > IP{Y = s} for all integers s ≥ 2. Therefore, we meet a scientist without papers or with citing papers with maximal probability. If we limit ourselves by consideration of the scientists having at least one citation then the highest probability have authors with one citation. The Laplace transform of the distribution of Y has form

Its asymptotic as
This relation shows that the random variable Y has moments of order less than p and does not have moments of higher order. Because p < 1 the variable Y has infinite mean. In practice, this means that some scholars have a very large number of citations. These citations refer to publications by a relatively small number of scholars. Of course, the data in Tables 7.1 and 7.2 are in agreement with these statements. It is important that the model is built on the assumption of the same capabilities of scientists. Even so, we must observe greater variability in the number of citations of their publications. Thus, the difference in the number of citations can be purely random and not say anything about the real contribution of the scientist into corresponding science fied.
Of course, the proposed model is very idealistic, since it does not take into account the real difference in the capabilities of scientists, as well as in their equipping with the necessary tools and equipment. Taking into account the noted differences is likely to lead to the need to consider mixtures of the proposed distributions with different parameters p and ku. However, such a complication will not make it possible to distinguish scientists with a large contribution to science from those with a smaller impact.
Surely, the arguments presented for the choice of γ = 1 are rather crude, i.e. in reality, it may happen that γ is close to unity. Although in this case, the distribution tail is not heavy, but over a very large (but finite) interval it is close to heavy. So, qualitatively, our conclusions will remain unchanged.
Based on the foregoing, we conclude that it is practically senseless to use the number of citations of a scientist's work to assess his contribution to science.

Remarks on the model with γ > 1
In this subsection, we are trying to justify the possibility of using models with gamma greater than one. As already noted, in this model the probability IP{Y = ∞} is not equal to zero. It is unlikely that this corresponds to the situation with the consideration of all scientists working in this field of science. However, a very long citation process (ideally, endless) is quite possible in the case of the most prominent scientists. For example, in the field of mathematics, the works of Professor Andrei Nikolaevich Kolmogorov  continue to be cited. Over the past 15 years, they have been cited about 30,000 times, although more than 30 years have passed since the death of their author. It is highly probable that the citation process for these works will continue for a long time.
In addition, the concept of citation is somewhat arbitrary in our opinion. For example, in mathematics, some theorems or other objects bear the names of scientists who were related to their preparation. Does the mention of these theorems and the corresponding names in some articles mean their citation? For example, many articles and books mention the Gaussian distribution without reference to the corresponding publication by Gauss. Is this mention a quotation? It seems to us that such kind of nominal results are not counted in determining the citation index. However, they certainly indicate the scientific significance of the result. It is very likely that for accounting for citations of this kind, models with a γ greater than 1 may be required.

Hirsch index
Recall that the definition of the Hirsch index was given on page 2. Hirsch states that the proposed index h is intended to rank authors of articles in the field of physics. At the same time, it is noted that the index can be used in other fields of science. Since the number of citations is used in determining the index h, it seems plausible that h is associated with this number. Hirsch notes that the number of citations N = κh 2 . He wrote: "I find empirically that κ ranges between 3 and 5" 2 . And further Hirsch wrote:"κ > 5 is very atypical value".
Below we show that the Hirsch statements presented here are doubtful. Also, the use of this index seems unreasonable.
Let's start by analyzing the data in Tables 7.1 and 7.2. Remind that the column 5 gives corresponding values of κ.  (5, 6). Therefore, at least for such fields as "Mathematics" and "Biostatistics", Hirschs conclusion about the "typical" form of proportionality between the number of citations of an author and the square of corresponding Hirsch index seems to be incorrect. However, was Hirsch right in the field of "Physics"?

Data in "Physics"
Now we give the data on field "Physics", arranging them into a table in the same way as for   Table 3. Citations "Physics". Again, Table 8.1 has only one κ ≤ 5, namely κ = 4.88. However, there are 6 values κ ∈ (5, 6). The kappa values for the "Physics" area look smaller than for the "Biostatistics" area and significantly smaller than for the "Mathematics" area. The value of the Hirsch index for physics has much less variability than for biostatistics and mathematics. The differences in citation numbers are much greater for mathematics than in the case of physics.
So, we see that Hirschs understanding of the situation in physics is closer to reality than in the case of biostatistics and, especially, mathematics.

Data comparison
Continue the analysis of the data in tables 7.1, 7.2 and 8.1.
The average value of the Hirsch index in the case of Table 1 is 99.3 with a standard deviation of 66.45. The same indicators for Table 2 are 153.8 and  47.97, and for Table 3 -198.2 and 21.73. We see that the standard deviation of the Hirsch index in the case of mathematics is three times greater than in the case of physics. On the contrary, the average value of the index is maximum in the case of physics and minimum in the case of mathematics. This shows that if Hirsch index is useful in the field of Physics, then its usefulness in the field of Mathematics is doubtful. Probably, it is true for Biostatistics too.
Authors with a higher Hirsch index are often inferior to others in the number of citations of the most popular works. For example, in Table 7.1, author 1, having the highest Hirsch index, is inferior to authors 2,4,5,6 and 7 in the number of citations of the most popular work. In this case, author 1 wrote his most cited work with co-authors, while author 2 -without coauthors.
It is clear that the Hirsch index does not exceed the number of cited publications of the author, which has an exponential distribution. Thus, the distribution of the Hirsch index has a light tail. Since the number of citations has a heavy tail, it is more variable than the Hirsch index. However, these two indicators are stochastically strongly related. Indeed, for the data in Table  1, the sample correlation coefficient between these indicators is ρ1 = 0.94. On the other hand, the correlation coefficient between the Hirsch index and the number of citations of the most popular works is ρ2 = −0.23. This coefficient indicates a small relationship between the indicators, and it is negative. In other words, a large Hirsch index is most likely not found among authors with highly cited individual articles. For Table 7.2, the values of the correlation coefficients equal to ρ1 = 0.702, ρ2 = 0, and for Table 8 The increase in the Hirsch index with a decrease in the number of citations of the most popular work may result in the division of the work into a series of publications. However, when assessing the quality of a scientist's contribution, one should take into account that the publication of a series of articles instead of one may be caused not by a desire to increase the number of publications, but, for example, by a gradual insight into the essence of the problem under consideration. Such insight often requires a very long time, i.e. publication of a series of articles is justified. It should be noted that the publication of a series of articles naturally leads to an increase in the number of self-citations. This increase cannot be considered as a flaw of the author and does not mean attempts to artificially increase the number of citations. At the same time, the presence of a series of publications (which increases the Hirsch index) cannot be considered as preferable to one highly cited work.
The presence of higher values of the Hirsch index in Physics compared to Mathematics can be explained by the use in modern physics of expensive equipment in experimental physics and/or the results obtained on it in theoretical physics. Often this equipment is used by some laboratory or scientific group, and then transferred to another or others. After some time, this equipment again becomes available to the first group. Thus, new experimental facts arrive intermittently, and during the break they are processed and published. A theoretical analysis of the observed facts is also taking place. And then comes new information related to new experiments. Therefore, the very flow of information (both experimental and theoretical) contributes to the publication of not a single article, but a series of articles. This circumstance leads to an increase in the Hirsch index with a relative decrease in the number of citations of popular works.
A similar situation is absent in pure mathematics. Therefore, there the appearance of the series has much fewer reasons. Separate works appear, which often cover a substantial part of the problem under consideration. They cause a stream of citation of this particular work, and in a series of works. Thus, the Hirsch index becomes smaller than it would be if a series of articles were published instead of this one, but the most popular work causes more citations than each individual work in the series.
So, the use of the Hirsch index has some basis in the field of Physics, but it is not related to what is happening in Mathematics.
For some areas of applied mathematics, a situation may be observed that is intermediate between what is happening in physics and in pure mathematics.
However, it is not clear to us why not replace the Hirsch index with two. The first of these could be the number of all citations, and the second -the number of citations of the most popular work. The Hirsch index is stochastically quite closely linked to the number of all citations, so it and this number are "interchangeable." However, after the termination of the work of a scientist in a given field of science, the number of his publications does not increase and, therefore, the Hirsch index remains limited, while the number of citations can continue to grow unlimitedly. This is exactly what happens with the works of the most outstanding scientists of the past.

Distribution of the Hirsch index
In this section, we obtain the probability distribution of the Hirsch index.
We introduce some notation. It is clear that the Hirsch index is a random variable. Let us denote it by H. We will denote the values of this H by h. Our aim here is to determine the probabilities that H = h, i.e. IP{H = h}. In order for the event H = h to occur, it is necessary and sufficient that: a) no less than h works were published; b) h of the published works are cited at least h times, and the rest -less than h times.
Suppose that l works are published, and l ≥ h. The probability of this event is q(1−q) l . Recall, the probability that a published work will be quoted k times equals to (p/k) k−1 j=1 (1 − p/j). Therefore, the probability that the published work will be cited at least h times equals to where Γ is Euler gamma function. The probability that a published work will be cited less than h times is defined as .
Thus, the probability that l papers are published, and the Hirsch index H has taken the value h is Now we see that So, the random variable H has the following distribution .
Note that this distribution is not geometric one because the value of ν depends on h.
Next, we are interested in estimating the tail of the distribution of H. To do this, we estimate the asymptotic behavior of the ν. The application of the Stirling formula allows one to easily obtain that This formula immediately leads us to an asymptotic expression for the logarithm of probability IP{H = h} for h → ∞. Namely, log IP{H = h} ∼ p · h · log h, h → ∞.
It follows that the probability of the event {H = h} decreases faster than the exponential function for n → ∞. Of course, the tail of the distribution of H also decreases faster than the exponential function. Therefore, there are moments of all orders of this distribution. Note that the distribution of the number of citations of articles by this author has an infinite mean value. So, if an author has a fairly large number of citations, then the ratio of the number of citations to the square of the Hirsch index can be arbitrarily large. This fact contradicts Hirschs claim that κ is bounded.