Anomalous diffusion in the citation time series of scientific publications

We analyze the citation time-series of manuscripts in three different fields of science; physics, social science and technology. The evolution of the time-series of the yearly number of citations, namely the citation trajectories, diffuse anomalously, their variance scales with time $\propto t^{2H}$, where $H\neq 1/2$. We provide detailed analysis of the various factors that lead to the anomalous behavior: non-stationarity, long-ranged correlations and a fat-tailed increment distribution. The papers exhibit high degree of heterogeneity, across the various fields, as the statistics of the highest cited papers is fundamentally different from that of the lower ones. The citation data is shown to be highly correlated and non-stationary; as all the papers except the small percentage of them with high number of citations, die out in time.


I. INTRODUCTION
Studying the structure of underlying patterns in different fields of science gives insight into the evolutionary process of the system over time, and also provides the ability to make general assumptions about its future [1][2][3]. Citation networks can be used for not only revealing the hidden patterns and structure in different fields of science, but they also disclose the substantial human behavior which forms them [4][5][6][7]. Following a growing interest, the number of publications on citation analysis has been significantly increasing in the past few decades [8][9][10]. The analysis so far suggests a nearly universal behavior in science across very different disciplines [11][12][13][14][15]. One prominent observation reveals the presence of a small percentage of papers in each discipline that dominate the field, with the number of their citations growing increasingly faster as they gain more attention in time, whereas on the other hand there is a large number of papers whose influences are quickly dying out [16][17][18][19]. This heterogeneity, as we will also show, is related to the anomalous growth of the variance of the process. In this paper, we study the yearly citation time series in three different fields: physics, social science and technology. The records have been collected from the WoS database [20], and include publications between the years 1974 and 2012. Within this dataset, we chose papers which were published during five years, 1974 to 1978, and track their citation time series up until 34 years after publication. We use two different methods to calculate the temporal behavior of the average yearly number of citations, as well as their variance, and we study the non-stationarity of the time series, its correlations and the effect of large fluctuations.
In the Time-Average (TA) approach, we first consider the quantity C i (t ), defined as the number of citations that an individual paper i gains in a single year, t years after its original publication. Then, the average citations number for this paper after t years, is C i (t) = 1 t t t =1 C i (t ). In the Ensemble-Average (EA) method, we look at the number of citations of each of the manuscripts per year, without considering their history. Thus, the ensemble mean at year t is C(t) = 1 N N i=0 C i (t). Here, N is the total number of analysed papers. It is not uncommon practice to combine the two former methods, and obtain the Ensemble-Average Time Average (EATA); C i (t) = 1 According to the law of large numbers, if N is large and the system is in statistical equilibrium, the two approaches yield completely similar results: the probability distribution of the TAs becomes a δ-function around its mean, and C i (t) = C(t) . When this result is violated, however, the system is said to exhibit weak ergodicity breaking [21][22][23]. In such a case it is important to discuss the difference between the two approaches, as mean values in the system depend on the method of measurement. Note that weak ergodicity breaking is more commonly discussed in the context of the mean squared displacement versus time-averaged mean squared displacement (see definitions, below), which describes the time evolution of the increment fluctuations in the data, see e.g., [24][25][26][27]. Using both approaches for calculating the mean statistical properties of the system, and using data from three different disciplines, we found that the citation trajectories diffuse anomaly and are non-ergodic. Therefore, we investigate the anomalous diffusion of citations by a detailed analysis to find the basic factors which lead to such anomalous behavior. First we study the correlation of citations and quantify it by the so-called Joseph exponent [28,29], which has been derived from timeaverage of the Mean Squared Displacement (MSD). The second and third factors which lead to anomalous growth of citation time series are non-stationarity, quantified by the Moses effect, [29][30][31] and the Noah effect [28][29][30][31] (stemming from fat-tailed distribution) which is quantified using Latent exponent.
The structure of the paper is as follows: In sub-Sec. II A, we study the yearly citation distributions and compare them for three different disciplines. In sub-Sec. II B, we investigate the mean and variance of the yearly citation number, comparing time-and ensemble-averages, for all the papers in the field, as well as when we separate between different classes depending on their total number of citations after 34 years. We also compare these results between  different research fields. In sub-Sec. II C, we study the time averaged MSD, and obtain from it the scaling properties of the auto correlations of the citation number between different years. In sub-Sec. II D, we quantify the non-stationarity of the citation time series. In sub-Sec. II E we study the effect of fat-tails of the yearly citation distribution on the growth of the MSD, and how they change over time. Finally, in sub-Sec. II G we compare the non-ergodic behavior in the three fields. The discussion is found in Sec. III.

A. Citation distribution
In this section, we study the normalized distributions of the yearly citation numbers C i (t) of papers published during five years from 1974 to 1978. Table.I presents the scientific disciplines and the number of papers whose citation distribution has been analyzed in this paper. The length of each series is 34 years, t = 1 corresponds to the year of the publication. The normalized distributions of (C i (t)) in a single year, 20 years after publications, are depicted in figure 1 for three different disciplines. Figure 1 shows a power-law decay with a slope of around −3.2 for all three distributions, suggesting a qualitatively similar distribution and a universal behavior across these fields. The power-law nature of the yearly citation distribution for the ISI academic papers and publications, in the journal of Physical Review, is known [14,[32][33][34]. However, few other studies reported log-normal distributions as well, and detected universal scaling properties of such distributions in some more specialized categories of science than the ones we are analysing, e.g., nuclear physics and engineering [11,35]. Note that, probability distributions with power-law tails ∝ 1/C γ at large C, can be divided into two categories, with significantly different statistical behavior: When 0 ≤ γ ≤ 2; the mean-square In this case the dynamics is called "scale-free", and no matter at which time-scale we will observe the series C(t), its shape will be governed by only one or few large fluctuations. When γ ≥ 2, the first and second moments of C are finite. In our case γ = 3.2 > 2, as was also concluded by Golosovsky [17], the yearly citation distribution does not appear to be scale-free. However in Sec. II E, we will revisit this result, adding a second observation, which may point out to the opposite conclusion.

B. The Mean and Variance
As discussed in the introduction, we study the EA of the citations at each year after publication, up until 34 years, as well as TAs. Figure. 2 shows the EA of the citations for different groups of papers, for example, in the field of physics, distinguished by the number of citations in each category. The largest category belongs to the papers with citations less than 100, includes 496085 papers, which stand for 97.91% of all the analyzed papers in this field. As shown in this figure, the EA of the lowest categories decreases to zero at long times, whereas in the middle categories, such as 400-500, at least in our measurement period it seems to reach saturation at a value different from zero. In the categories of papers with more than N c = 900 citations which, when combined, include only 0.026% of the whole ensemble of analyzed works, the EA continues to grow indefinitely (within our time window), as these famous,  highly cited papers obtain increasingly more attention with time. Note that the rate of growth is also changing, after approximately 20 years from the time of publication. The threshold for the number of citations beyond which the EA starts to grow continuously, is changing between different fields of science. For instance, N c = 900 in physics, N c = 400 for social science and N c = 150 in technology. These variations show different levels of competition in different fields, and imply for example that "getting more popular in the field of physics is harder than the other two fields". In a previous study, Golosovsky [17] reported that citation time series of the low-cited papers which are published in the same year, reach to a saturation after 10-15 years, while the highly cited papers grow indefinitely. This phenomenon, in which "the rich gets richer and the poor gets poorer", is known in scientific publications [36], and is explained as an outcome of the growth rule in complex networks which is called preferential attachment [37,38]. Figure. 3-(a) shows the ensemble average of the citations for all of the analyzed papers in physics (in the other fields of research, our analysis showed qualitatively similar results), as well as papers with citations less than 900 on logarithmic scale. For the low cited papers and the total publications, we observe a power-law decay with the exponent ≈ −0.61. The similarity between the two slopes demonstrates the dominance of the low-cited papers on the behavior of the full ensemble because of the sheer number of papers in this group. In figure. 3-(b), the EA is shown for the highly cited physics papers (with more than 900 citations) which shows a power-law growth with the exponent ≈ 0.4. In figure 4, we present the EATA, and its comparison between three different research fields. As we mentioned above, the threshold value (N c ) is changing from field to field, which changes the percentage of what we refer to as "highly cited papers". When comparing between EAs and EATAs in figure. 5, where we plot the ratio EA EAT A over time, we can see that in all cases this ratio saturates at some fixed value. This is no surprise, since naturally if the EA is ∝ t α , where α is a constant exponent, then clearly the TA ∼ 1 t EA(t )dt has to have a similar exponent. But since exact values are also important, and not only the qualitative growth of the power-laws, one should note that the constant limit to which this ratio converges in this case is α + 1. In figure 5 for physics and technology, it saturates around ≈ 0.6, and for social science, ≈ 0.7. These values are different from the expected α + 1, since EA and EATA do not scale linearly in the whole time of study, therefore their corresponding exponents are not equal. Note that it's not only the corresponding EAs of highly cited papers which grow in time, but their variance also grows; var = C 2 (t) − C(t) 2 . Figure. 6-(a) demonstrates the change of the variance over time for citation time series of the whole physics papers. The variance starts to grow after ≈ 25 years of publications, this behavior is the result of the dominance of highly cited publications on the statistics of the total citations and could be understood by analysing the citations of low and highly cited papers separately. For low cited publications ( figure.6-(b)), the variance decreases in time and saturates around 1. However in figure. 6-(c), for highly cited ones with citations more than 900, the variance grows indefinitely and the speed of growth even increases after 25 years.

C. Time averaged MSD and the correlation
Let Y i (t) = t n=0 C i (n), be the cumulative sum of the yearly citation number, for some paper i, until the year t after its publication. Here, we study the fluctuations of the citation trajectories, using the Time Averaged Mean Square Displacement (TA-MSD). For a single trajectory (citation history of one particular paper i), TA-MSD is given by, [39]; (1) The above moving average sums the number of citations added for each trajectory, at intervals of duration ∆, until time T − ∆, and divided by T − ∆. t = T is the maximal measured time in our data, which is T = 34 years. The ensemble-averaged TA-MSD is: In the data analysis, we consider the maximum lag-time as ∆ = | T 3 |. Recently, the TA-MSD has been shown to scale as [30,31] where J ∈ [0, 1] is called the Joseph exponent, which is associated with the autocorrelations in the time series. For a random process without long-ranged autocorrelations; J = 1 2 . If a process is long-ranged and positively correlated; 1 2 < J ≤ 1, and for an anti-correlated process 0 < J < 1 2 [28,29,40]. H is called the Hurst exponent, which also quantifies the temporal growth of the MSD; Y 2 ∝ t 2H at long t [28]. In standard Gaussian processes; H = 1/2, whereas H > 1/2 indicates that the process is super-diffusive, and H < 1/2 is sub-diffusive.  Figure. 7 displays the scaling of δ 2 (∆) with respect to ∆ for all the publications in the three mentioned research fields, as well as for the highly cited papers. According to equation 2, the slopes in figure 7 equals 2J. Therefore, the Joseph exponent in all the citation trajectories is higher than 1/2, representing a high degree of correlation between the number of citations the papers get in subsequent years, across all the fields. This was to be expected, since if a papers is popular and gets high number of citations after its publication, it keeps growing due to its high exposure, while less popular papers, with low number of citations in the first years, will often stay in the same state in the years that follow. What this shows may be somewhat encouraging for some researchers, or discouraging for others, since it means on one hand our "citation culture" is not completely arbitrary, but on the other hand: if your work does not succeed in becoming popular quickly, your chances to change this situation later are not so great. This trend of behavior is not only for high and low cited papers but also for averaged cited ones too. Note that the Joseph exponent for highly cited papers, is a bit larger than that of the averaged and the lower cited ones.

D. Characterizing the nonstationarity of citation trajectories, using the Moses exponent
In general, in a stochastic process x t , if the probability distribution of the increments (x(t+τ )−x(t)) is independent of t, the increments are considered stationary. In Refs. [29][30][31]41], it was shown that the level of non-stationarity can cited ones. The values of the Moses exponent for the highly cited papers in all three fields are high, showing the non-aging behavior of these papers in time. However, the low cited papers with Moses exponent close to zero age and die out, in the sense that the rate at which they get new citations only decreases with time be quantified by a single parameter, which is called the Moses exponent. This parameter can be measured directly from the data. Recall that we defined Y i (t), as the cumulative sum of the yearly citations of a single paper, after t years. For a general time-series, the Moses exponent M is defined via the temporal scaling of the EA, of the sum of the absolute-increments of the process [29]. In our case, since the increments of Y i (t) are always positive; |C i (n)| = C i (n), we obtain the Moses exponent from where, as mentioned above angled brackets mean ensemble average, and overline means time-average. For normal diffusive processes, with stationary increments; M = 1/2. For nonstationary process, when the process dies out since its increments become smaller in time; M is smaller than 1/2. When we find M > 1/2, it indicates that the absolute-increments (yearly citations) grow in time. Figure. 8 shows the scaling Y t /t 1/2 versus t, for physics, social science and technology. The figure shows that in the first few years the mean of Y t does not grow like a power-law, as indicated by the non-linear scaling of the log-log plot, but from around t = 10, the growth is power-law with exponents which are smaller than 1/2 in all the fields. This demonstrates that overall citation time-series die out in time. To investigate the non-stationarity of this process better, we repeat the analysis for the two groups of low cited and highly cited publications. In figure 9-(a), the scaling of Y t /t 1/2 for the highly cited papers are shown. M for all three curves are higher than 1/2, which means that the yearly citations of the most popular papers increase over time and this means that they increase in popularity even regardless of the correlations (recall that the Moses and the Joseph effects are two separate effects [29]). However, as mentioned above the statistics of highly cited papers have a very small effect on the measured Moses exponent of the whole ensemble, since they are few in number. For low cited papers in figure 9-(b), Y t /t 1/2 , in all three fields, the exponent M is close to 0, since a paper stops getting cited and its citation time series reaches a plateau with no new citations added anymore.
E. The significance of large, rare fluctuation; the Noah effect in citation trajectories In Subsec. II A, we found that the tails of the yearly-citation distribution have a power-law tail which roughly falls-off ∝ 1/C 3.2 (t). Naively, this seems to suggest that the citation time-series is not scale free, namely that its mean and variance are finite. This is also in agreement with the observation reported in Ref. [17]. However, we would like now to introduce an additional observation, which might alter this conclusion. In time-series analysis, when we do not have a large ensemble of data to obtain clear statistics from, the influence of fat-tails in the increment distribution of the process on the anomalous diffusion, can be measured directly from a small number of sufficiently long paths, via the so-called "Noah effect" [28][29][30]. After we have obtained the Moses exponent in Subsec. II D, we now obtain the Latent exponent L [28], via Note that, here we use a sampled ensemble average, which is guaranteed to have a finite value at finite times, as opposed to the theoretical value which will be divergent if the increment distribution has scale-free fat tails. One can also use the sampled-median here, instead of the mean. When M = 1/2, namely when we do not observe aging effects in the time-series, if L = 1/2; Z t ∼ t, which is similar to a standard Gaussian process with a finite incrementvariance. On the other hand, when L > 1/2, this can only occur because the increment distribution at least has a regime, with a power-law shape that falls-off as 1/C γ , and 0 < γ < 2 [29,31]. Namely in this regime the distribution does not have a typical value. This power-law shape of the tails of the distribution might have a time-dependent cutoff, which is pushed towards ∞ as time increases, but it needs to be visible at a certain regime of C also in finite, but long times. Note that, at least at long times, L can not be smaller than half, since this would mean that the average of the squared-increments expands more slowly in time than the square of their mean absolute value. All this explanation holds also when M = 1/2. In figure 10, we show the scaling of Z t for publications of two fields of science (physics and social science) and highly cited papers. Figure 10-(a) shows the presence of a strong Noah effect, when we observe the full ensemble of trajectories together. This results from the highest cited papers, which obtain far more citations than the others. This Noah effect suggests that the citation data is indeed scale-free, but we could not detect that by observing the shape of the tails of the distribution, since the highly cited papers are too few in number. If we look again at Fig.  II A, we see that clearly beyond the region where we find a linear behavior (in the log-log plot), with a slope > 3, we find another regime where the slope is 1. The latter however is only the effect of the logarithmically-sized bins that we used to plot the distribution, and it indicates that in every bin in the far tails, we had found only 1 paper, since the papers are very sparsely distributed. All it means, is that we probably do not have enough data to sample the distribution well enough in the far-most tails, but we suspect that the shape of the tails there indeed corresponds to a scale-free distribution, whose trace is detected by the Noah effect. Figure 10-(b) shows the linear scaling of Z t for highly cited papers. The Latent exponent, L, for these publications is close to 0.5. Here, the explanation is clear: since we only choose to look at the distribution of a small percentage of all papers, which are located in the same region of the histogram, their own distribution is thin (this sample of the papers is more homogeneous than the total ensemble of all of them).

F. The Hurst exponent
So far, we have shown that the yearly citation time series exhibit both correlation, non-stationarity and possibly large fluctuations due stemming from a fat-tailed distribution. In Refs. [29][30][31]42], it was shown that using the three fields. The result from the field of technology has not been shown here since we did not find a linear scaling behavior for that. Linear fit is done in the region with the best linear behavior, there is a cross-over after 25 years which is related to the increasing of the speed of growth for variance of yearly citations has been shown in figure 6.  exponents; M, L and J, we can obtain the Hurst exponent via a simple summation relation Figure.11-(a) show the scaling of σ with time, for the full ensembles of physics papers and social science, with their corresponding Hurst exponents. Figure. 11-(b) shows the scaling of σ with time, obtained only from the highly cited papers. In all the categories clearly the citation data is super-diffusive, and for the highly-cited ones, even super-ballistic. In table. II, the four exponents for all the publications in the fields as well as for highly cited papers are represented. The relation in equation. 6 holds nicely in all the different categories. This indicates that one may use this summation relation to obtain any of the four exponents; M, L, J, and H, from the other three, since the correlations/ the non-stationarity and the large fluctuations in the time-series are not decoupled from each-other.

G. Weak Ergodicity Breaking
According to what we have discussed up to now, the citation time series are highly non-stationary, correlated, have large fluctuations, and consequently they are non-ergodic. In this sub-section, we quantify the weak ergodicity breaking , by measuring the ratio between the ensemble-time average δ 2 (∆) = 1 (1), and the ensemble average (Y i (T + ∆) − Y i (T )) 2 , as function of the increment duration ∆.
For diffusive processes the ensemble and time average of MSD are not equivalent, Y 2 (∆) = δ 2 (∆) [21]. Figure 12 shows that in the three fields of science that we study, this ratio δ 2 (∆) Y 2 (∆) grows with ∆, so care should be taken when one compares the results of one type of measurement procedure, and the other.

III. DISCUSSION
In this paper, the aging process in the time series of citations to scientific papers has been investigated, by considering first, the total ensemble of publications, and then by separating highly cited papers from less popular ones. We found that the anomalous diffusion in the distributions of citations is a results of the non-stationarity of the yearly citation distribution, as well as large fluctuations and temporal correlation. Here we see all the three effects leading to anomalous diffusion, combined.
• Citation time series for the three analyzed fields are highly correlated. The correlation is measured using ensemble average of mean squared displacement and quantified by an exponent called Joseph exponent.
• Citations trajectories are highly non-stationary, this effect has been quantified with a well defined Moses exponent (M ), for a Gaussian stationary process M equals 0.5. When M deviates from 0.5 towards higher values it signifies a growth in the process (papers get more attention in time). This effect has been observed in popular papers with citations more than 900, 400 and 150 respectively in physics, social science and technology. In a process which ages and dies out in time, the Moses exponent M is smaller than 0.5. This property has been detected for total publications in the fields because of high number of low-cited papers and their dominance in the statistics. For a low cited publication M is close to zero.
• Citation trajectories show a strong Noah effect with a Latent exponent above 0.8. This effect suggests a fattailed distribution of yearly citations which has not been observed in the tail of distribution because of the sparse number of highly cited papers.
All the three factors above lead to anomalous diffusion of the citation trajectories. The relation between the three scaling exponents and the Hurst exponent holds perfectly for the total publications as well as highly cited ones. For the possible future research, it will be interesting to know how the presence of online archives such as Google Scholar as well as new ways of giving exposure to the papers in future might affect the behavior of citation time series.